formularioHidden
formularioRDF
Login

Sign up

 

Dataset-Driven Research to Support Learning and Knowledge Analytics

Article

In various research areas, the availability of open datasets is considered as key for research and application purposes. These datasets are used as benchmarks to develop new algorithms and to compare them to other algorithms in given settings. Finding such available datasets for experimentation can be a challenging task in technology enhanced learning, as there are various sources of data that have not been identified and documented exhaustively. In this paper, we provide such an analysis of datasets that can be used for research on learning and knowledge analytics. First, we present a framework for the analysis of educational datasets. Then, we analyze existing datasets along the dimensions of this framework and outline future challenges for the collection and sharing of educational datasets.

"Introduction. The need for better measurement, collection, analysis and reporting of data about learners has been identified by several researchers in the Technology Enhanced Learning (TEL) field (Siemens 2010; Romero et al. 2007; Duval 2011). This need has been translated into an emerging strand of research on learning and knowledge analytics (LAK), as reflected by a number of conferences and special issues in recent years (Siemens & Gasevic, 2011). Among others, the analysis of learner data and identification of patterns within these data are researched to predict learning outcomes, to suggest relevant resources and to detect error patterns or affects of learners. These objectives are researched to act upon needs of a variety of stakeholders, including learners, teachers and organizations. This is what drives major initiatives such as the US-based Learning Registry (http://www.learningregistry.org) to collect data and make them publicly available for research and application purposes. Siemens (2010) defines learning analytics as “the use of intelligent data, learner-produced data, and analysis models to discover information and social connections, and to predict and advise on learning.” Contributions to the first conference on learning analytics and knowledge in 2011 indicate that information visualization, social network analysis and educational data mining techniques offer interesting perspectives for this emerging field. Whereas the specific techniques differ depending on context and the intended goals, the main objective of the approaches is to identify needs of target users and to support these needs using intelligent and adaptive systems. Despite the recognition of the importance of LAK, the literature related to this topic is rather limited. Research on web analytics, search engines and recommender systems are excellent examples of how data gathering during an analytics cycle can be used to refine offerings to users (Elias, 2011). Whereas several recommender systems for learning (Manouselis et al., 2011), intelligent tutoring systems (Romero et al. 2007) and visual analytics systems (Govaerts et al. 2010) have been implemented for use in learning scenarios in recent years, many of these intelligent tools often stay in researcher hands and rarely go beyond the prototype stage (Reffay & Betbeder, 2009). Among others, researchers have argued that the time needed by social science to validate prototypes is too long compared to the rate of technology innovation. An important component to facilitate research in this area is the existence of extensive overviews of the available datasets that will provide researchers with a wide array of potential data sources to experiment with, as well as with an analysis of their properties that will help researchers decide about their appropriateness for their experiments. Such an overview is missing from LAK today, since only initial attempts have been made to document and study existing datasets (Drachsler et al., 2010). In this article, we extend our initial analysis (Verbert et al., 2011) of the datasets collected by the dataTEL Theme Team of the European Network of Excellence STELLAR (http://www.teleurope.eu/pg/groups/9405/datatel/), in order to provide a more comprehensive overview of datasets for LAK research. The article makes four primary research contributions: - First, we present related initiatives that are collecting datasets and the needs and opportunities to make educational datasets available for LAK research. - Second, we present a framework for the analysis of educational datasets. In particular, we present properties of educational datasets and LAK objectives that can benefit from the availability of such data. - Third, we analyze existing datasets along the dimensions of our educational dataset framework. We also discuss existing research that has used these datasets for LAK related research. - Finally, we present future challenges to enable the sharing and reuse of datasets among researchers in this field. Background. In an increasing number of scientific disciplines, large data collections are emerging as important community resources (Chervenak et al., 2000). These datasets are used as benchmarks to develop new algorithms and compare them to other algorithms in given settings (Manouselis et al., 2010b). In datasets that are used for recommendations algorithms, such data can for instance be explicit (ratings) or implicit (downloads and tags) relevance indicators. These indicators are then for instance used to find users with similar interests as a basis to suggest items to a user. To collect TEL datasets, the first dataTEL Challenge was launched as part of a workshop on Recommender Systems for TEL (RecSysTEL, Manouselis et al., 2010a) that was jointly organized by the 4th ACM Conference on Recommender Systems and the 5th European Conference on Technology Enhanced Learning in September 2010. In this call, research groups were invited to submit existing datasets from TEL applications. A special dataTEL Cafe event took place during the RecSysTEL workshop in Barcelona to discuss the submitted datasets and to facilitate dataset sharing in the TEL community. Related work is carried out at the Pittsburgh Science of Learning Center (PSLC). The PSLC DataShop (Stamper et al., 2010) is a data repository that provides access to a large number of educational datasets derived from intelligent tutoring systems. Currently, more than 270 datasets are stored that record 58 million learner actions. Several researchers of the educational data mining community have used these datasets to predict learner performance. The Mulce project (Reffay & Betbeder, 2009) is also collecting and sharing contextualized interaction data of learners. A platform is available to share, browse and analyze shared datasets. At the time of writing, 34 datasets are available on the portal, including a dataset of the Virtual Math Teams (VMT) project. This project investigated the use of online collaborative environments to support K-12 mathematics learning. These datasets have been used extensively by the Computer Supported Collaborative Learning (CSCL) community (Stahl, 2009). Other efforts have been driven by fields studying child language acquisition. The CHILDES system (MacWhinney, 1996, 2007) helped realize much advancement in this field through sharing language-learning data. TalkBank (http://talkbank.org) is a follow-up project that is researching guidelines for ethical sharing of data, metadata and infrastructure for identifying available data, and education of researchers to the existence of shared data, tools, standards and best practices. LinkedEducation.org is another initiative that provides an open platform to promote the use of data for educational purposes. At the time of writing, five organizations have contributed datasets. Available datasets describe the structure of organizations and institutions, the structure of courses, learning resources and interrelationships between people. In addition, various schemas and vocabularies are provided to describe the internal structure of an academic institution, discourse relationships, activity streams in social networks and educational resources. Such schemas and vocabularies offer interesting perspectives for the sharing and reuse of educational interaction data that is relevant for LAK research. Several other initiatives are available that focus on providing the means to share datasets among researchers in a more generic way. DataCite.org is an organization that enables users to register research datasets and to assign persistent identifiers to them, so that datasets can be handled as citable scientific objects. The Dataverse Network (King, 2007) is an open-source application for publishing, citing and discovering research data. The network was established at Harvard University and is aimed to increase scholarly recognition for data contributions. Fact sheets of datasets are gathered from organizations and researchers are encouraged to make the data publicly available, if possible. The Australian National Data Service (Treloar & Wilkinson, 2008) is a similar initiative in Australia that works on services to help researchers persistently identify and describe data. In this paper, we analyze educational datasets that have been collected by dataTEL and related initiatives. We focus specifically on datasets that contain interaction/usage data of learners and that can be used for analytics’ research. In the next section, we present a framework for educational datasets that can be used to describe and analyze educational datasets. In addition, we discuss how work of related initiatives fits within this framework. Then, we analyze available datasets along the dimensions of this framework. A Framework for Educational Datasets. In this section, we present a framework for the analysis of educational datasets. The framework is intended to address questions researchers might have about the potential usefulness of a dataset for their research purposes. As illustrated in Figure 1, the framework constitutes three parts. Dataset properties describe the overall dataset, such as the application and the educational setting from which the data was collected. Data properties define at a finer grained level where data elements available, including action types such as downloads or selects and information about the learner and other entities involved. The third part of the framework defines a list of objectives of LAK research. These objectives are mapped to dataset and data properties in the next section to determine the potential usefulness of a dataset for LAK research purposes. Figure. 1. Educational dataset framework. Dataset properties. We recently (Drachsler et al., 2010) presented a specification of datasets that was used for the first dataTEL challenge. Among others, the specification includes information about the application in which the dataset has been collected, the educational setting, contact person, availability (open access or legal protection rules that describe how and when the dataset can be used), dataset collection method, dataset statistics, and pre-processing steps that have been applied to the data. Related initiatives have also defined formats to package and describe datasets. As illustrated in Figure 2, a Mulce dataset is comprised of the following components: - The instantiation component includes all interaction data, as well as user information. - The learning design component describes the educational scenario. - The research protocol describes the methodology of research with the dataset. - The license component specifies dataset provider and user rights. - The analyses component contains research outputs. Figure 2. Mulce format (Reffay & Betbeder, 2009). The PSLC DataShop project defines a specification for describing datasets that are derived from intelligent tutoring systems. The specification includes the project name, principal investigator, curriculum, collection dates, domain, application, description, hypothesis (e.g., “people who are required to use the tutor show less error on quizzes”), school, statistics and knowledge models of interactions. We discuss such interaction models in the next section. In addition, research papers with the datasets are referenced. As illustrated in Table 1, there are many similarities between the specifications. Explicit information is indicated by “+” signs. This information constitutes explicitly articulated elements of the specifications. Implicit information is indicated by “(+)” signs and represents information that is implied or expressed as part of other elements. For instance, in the dataTEL specification, information about the domain or users can be described as part of the description of the application or environment, but no specific fields are provided for these elements. To date, the Mulce format provides the most comprehensive format for describing datasets. In addition to interaction data, the datasets incorporate a detailed description of the educational scenario in a learning design component and results of various analyses. Therefore, this specification provides the most interesting perspectives for describing educational datasets in a generic way. Data properties. In addition to a format for describing datasets, there is a need to identify at a more fine-grained level of granularity which data elements are stored. Such information is essential to identify for which research purposes a dataset is useful. As outlined by Romero et al. (2007), the TEL field differs from the e-commerce analytics field in several ways. In e-commerce, the used data are often simple web server access logs or ratings of users on items. In TEL, many researchers use more information about a learner interaction (Pahl & Donnellan, 2002). The user model and the objectives of the systems are also different in both application domains (Drachsler et al., 2009). Table 1. Overview dataset properties. Figure 3. Learner Action Model. A survey of existing TEL interaction data models has been presented in (Butoianu et al., 2010). Such models capture actions of users on resources, such as open/close, select/unselect or write actions, on resources. In addition, the context in which an action occurred can be captured, such as the current application the author is working with or her current task. The Atom activity stream RDF mapping of the LinkedEducation.org initiative provides such a model for actions of users in social networks. Vocabularies for actions, actors and objects involved and related contextual information are defined. In addition to interaction models, learner models have been elaborated that describe several characteristics of learners. Brusilovsky and Millan (2007) identified the following categories based on an extensive analysis of the literature: knowledge levels, goals and tasks, interests, background and learning and cognitive styles. In addition, several models, standards and specifications have been elaborated to describe learning resources. The IEEE LOM and Dublin Core metadata standards are prominently used to describe learning resources, including general characteristics, such as title, author and keywords, technical and educational characteristics and relations between learning resources. We integrated the various data categories and elements in Figure 3. We use this model in the remainder of this article to identify data elements in existing datasets. The model has been developed by synthesizing existing works on interaction data and context variables in the TEL field that were outlined above. It could be further refined by studying relevant theoretical frameworks, like the Activity Theory (Kaptelinin et al., 1995), which could help reorganize the various categories and elements. Future research work in this area is discussed in the last section. Learning and knowledge analytics objectives. In order to provide guidance on the relevancy of datasets for LAK research, we identify a set of objectives that are relevant for LAK applications. We also outline existing research work in related research communities that, when interconnected, can provide substantial synergies to advance the emerging LAK field. - Predicting learner performance and modeling learners. The prediction of learner performance and modeling of learners have been researched extensively by the educational data mining, educational user modeling and educational adaptive hypermedia communities. The objective is to estimate the unknown value of a variable that describes the learner, such as performance, knowledge, scores or learner grades (Romero & Ventura, 2007). Such predictions are for instance used by intelligent tutoring systems to provide advice or hints when a learner is solving a problem. Dynamic learner models are also researched to support adaptation in educational hypermedia systems (Brusilovsky & Millan, 2007). - Suggesting relevant learning resources. Recommender systems for learning have gained increased interest in recent years. A recent survey of TEL recommender systems has been elaborated by Manouselis et al. (2011). These systems typically analyze learner data to suggest relevant learning resources, peer learners or learning paths. - Increasing reflection and awareness. Several researchers are focusing on the analysis and visualization of different learning indicators to foster awareness and reflection about learning processes. These indicators include resource accesses, time spending and knowledge level indicators (Mazza & Milani, 2005). - Enhancing social learning environments. Analysis and visualization of social interactions is researched to make people aware of their social context and to enable them to explore this context (Heer & boyd, 2005). In TEL, this is particularly, but not only, relevant for Computer Supported Collaborative Learning (CSCL) (Stahl, 2009), where the interactions with peer learners are a core aspect of how learning is organized. In CSCL, much research has focused on the analysis of networks of learners, typically with a social network analysis approach (Reffay & Chanier, 2003). - Detecting undesirable learner behaviors. The objective of detecting undesirable learner behavior is to discover those learners who have some type of problem or unusual behavior, such as erroneous actions, misuse, cheating, dropping out or academic failure (Romero & Ventura, 2007). - Detecting affects of learners. Researchers in TEL often refer to the affective states defined by D’Mello et al. (2007). These states are classified as boredom, confusion, frustration, eureka, flow/engagement, versus neutral. Among others, the detection of affects is researched to adjust pedagogical strategies during learning of complex material. The objectives are highly interrelated. For instance, whereas research on affects and awareness and reflection has traditionally focused on an individual perspective, these objectives are also researched increasingly to enhance social learning environments. Datasets for Learning and Knowledge Analytics. In this section, we present an analysis of datasets that can be used for a wide variety of LAK research purposes. We analyze the datasets along the dimensions of the dataset framework that we presented in the previous section. Table 2. Overview dataset properties. Dataset properties. Table 2 provides an overview of characteristics of available educational datasets, including the application from which data were collected, collection period, statistics and educational context or domain. The full description of the datasets is available on the portals that provide access to these datasets, including the dataTEL (http://www.teleurope.eu/pg/pages/view/50630/), DataShop (https://pslcdatashop.web.cmu.edu/) and Mulce (http://mulce.univ-bpclermont.fr:8080/PlateFormeMulce/) portals. Several dataTEL datasets have been collected from learning management systems (LMS). The UC3M dataset also collects data from a virtual machine that was used in a C programming course. The particularity of this dataset is that it records actions from several tools learners are using. The approach enables to collect a more comprehensive overview of learner activities, such as a learner searching for additional resources on the web. Such an approach is also researched under the prism of personal learning environments (PLEs), where data is tracked from learning environments that assemble relevant tools for course activities. Many other dataTEL datasets were collected from web portals that provide access to large collections of learning resources. Several other datasets are collected from intelligent tutoring systems (ITS) – including a large number of datasets from the PSLC DataShop initiative. We include the “Algebra 2008-2009” and “Bridge to Algebra” datasets that were used for the KDD 2010 Cup on educational data mining (https://pslcdatashop.web.cmu.edu/KDDCup/) in this analysis. In addition, the recommended datasets of the DataShop are analyzed. At the time of writing, 64 datasets are publicly available. Finally, many of the Mulce datasets contain data that were captured from forums, chat and email conversations between learners in collaborative learning settings. The collection period varies from 10 days to 6 years. Several of the Mulce datasets capture data of group work during a specific learning activity. Datasets derived from learning management systems and web portals often capture data during a longer period of time, ranging from a couple of months to several years. Although a few datasets that are available capture data of a large number of users, many other datasets are more limited in size. Some datasets collect data of 1000 to 7000 users. Several other datasets capture data of a few learners only. These datasets are in some cases only a sample that the organization made available or in other cases datasets of a small number of collaborating users, such as the VMT and mce-copeas datasets. Several datasets are openly accessible. For other datasets, legal protection rules apply. We obtained these datasets by sending a statement of our intended research purposes to the organization and then signed an agreement on the use of these data. All datasets contain data that is anonymized, so that it can no longer be linked to an individual. Data properties. Table 3 presents a more detailed overview of the data elements that are included in the datasets. The datasets contain a diverse set of actions of users. These actions include attempts of learners on quizzes, search actions, selection, annotation, rating, creation or editing of resources. PSLC datasets derived from intelligent tutoring systems all include attempt actions on activities provided by the tutor. In some datasets, help requests are stored. The input provided by learners is sometimes further specified into select, write or create actions. The Mulce datasets capture social interactions – in most cases these constitute send and receive actions. Explicit information about learners (or teachers) is stored in only a few datasets. The data is in most cases anonymized and little additional information about learners or teachers is stored. Some dataTEL datasets contain information about the language, interests, knowledge level or country of the user. Some DataShop datasets describe the gender and knowledge level of the learner, including her past grades. The mce-copeas dataset divides learners in three groups according to their knowledge level (beginner, medium, expert). Information about country, age, language and gender is often provided in Mulce datasets. Information about resources is available in more datasets. The information ranges from an identifier of the resource to detailed descriptions that include educational characteristics such as duration, minimum age, maximum age and resource type, technical characteristics and annotations such as tags and comments. Such metadata are often provided in dataTEL datasets that were captured from learning repository portals. In the DataShop datasets, educational information such as average duration and required skills are sometimes provided. In addition, compositional relations are provided that define a hierarchy of units and sections. Social relations between learners collaborating are stored in the Mulce and some dataTEL datasets. Additional context information is also stored. Several datasets provide timestamp information. The duration of an action is stored explicitly as a time interval in the DataShop and some LMS datasets. Such information is valuable to calculate the difference of estimated durations, described in resource metadata, and the time the learner needed in practice to complete an activity. Other contextual information is not often available. In datasets that contain data of multiple tools and services, information about the application from which an action was triggered is included. Table 3. Overview data properties. Finally, the result of actions, such as correct or incorrect attempts, rating values or error messages, is stored. In addition, some datasets contain the grade a learner obtained for an activity or course. We elaborate in the next section how such data can be used for LAK research. Usefulness of available datasets for LAK related research objectives. Prediction of learner performance and discovering learner models. Several datasets are available that can support research on prediction of learner performance and discovery of learner models. Among others, such predictions are researched to provide advice when a learner is solving a problem (Romero et al., 2007). Datasets from intelligent tutoring systems that capture attempts of learners provide a rich source of data to estimate the knowledge level of a learner. Some datasets derived from LMSs contain data on the number of attempts and total time spent on assignments, forums and quizzes. Romero et al. (2008) compared different data mining techniques to classify learners based on such LMS data and the final grade obtained for courses. Datasets that have been captured from PLEs offer interesting perspectives to elaborate such research in open learning environments. In addition, many datasets are suitable to identify interests of users based on resource accesses. Several researchers have already experimented with the datasets outlined above to predict learner attributes. The “Algebra 2008-2009” and “Bridge to Algebra” datasets were used in the KDD Challenge 2010. Participants were asked to learn a model from past learner behavior and to predict their future performance. The winners of this competition combined several educational data mining techniques. Cen et al. (2007) performed a learning curve analysis with the “Geometry Area” dataset. They noticed that while learners were required to over-practice some easy target skills, they under-practiced harder skills. Based on this observation, they created a new version of the tutor by resetting parameters that determine how often skills are practiced. References to other studies with these datasets are available on the DataShop portal. Suggesting learning resources. Several dataTEL datasets contain explicit relevance indicators in the form of ratings that are relevant for research on recommendation algorithms for learning. In addition, implicit relevance indicators, such as downloads, search terms and annotations, are available that can be used for such research. If time interval data is available, the data might be suitable to extract reading times in order to determine the relevancy of a resource. In addition, such datasets are useful to analyze information about sequences of resources as a basis to suggest learning paths. Manouselis et al. (2010b) used the Travel well dataset to evaluate recommendation algorithms for learning. Similar experiments have been reported in (Verbert et al., 2011). In this study, the Mendeley and MACE datasets were also used. Although still preliminary, some conclusions were drawn about successful parameterization of collaborative filtering algorithms for learning. Outcomes suggest that the use of implicit relevance indicators, such as downloads, tags and read actions, are useful to suggest learning resources. Increasing reflection and awareness. Several datasets are useful for analysis and visualization of different learning indicators to foster awareness and reflection about learning processes. In addition to indicators about the knowledge level of learners, several datasets contain indicators of the time learners spend on learning activities – such as the PSLC DataShop datasets. Other datasets contain timestamp information that can be used to derive indicators of the time users were active. dataTEL datasets were for instance used to obtain such indicators as a basis to support awareness for teachers (Govaerts et al. 2010). A visualization of these indicators applied to the ROLE dataset is illustrated in Figure 4. Evaluation results indicate that the perceived usefulness for teachers is high. The MACE dataset has been used for research on reflection and awareness of resource accesses. The Zeitgeist application (Shmidz et al., 2009) gives users insight into which learning resources they accessed, how they found them and which topics have been of interest to them (see Figure 5). Figure 4. Visualization of time indicators (Govaerts et al., 2010). Figure 5. MACE Zeitgeist (Shmidz et al., 2009). Enhancing social learning environments. Several Mulce datasets are useful for research on collaborative learning. The datasets have been captured from chat tools, forums or email clients. Such data can be analyzed to predict and advice on learning in group work. Datasets that have been captured from LMSs often capture messages within course forums. Some of the PSLC DataShop datasets capture collaborative activities with intelligent tutoring systems, including the “Electric Fields – Pitt” dataset. Several datasets have already been used to support research on enhancing social learning environments. Research with the “Electric Fields – Pitt” dataset suggests that asking learners to solve problems collaboratively with an intelligent tutoring system is a productive way to enhance learning from an ITS (Hausmann et al., 2008). Several Mulce datasets have been used for research on collaborative learning (Stahl, 2009). Among others, the datasets have been used to understand mathematical ideas and reasoning in chat by learners, interaction mechanisms used by online groups to sustain knowledge building over time and the measurement of cohesion in collaborative distance learning. Evaluation studies showed that such analysis, when embodied in visualization tools (see Figure 6), can efficiently assist the teacher in following the group collaboration (Reffay & Chanier, 2003). These analyses were used to highlight isolated people, active sub-groups and various roles of the members in group communication. The mce-copeas dataset has been used to research the influence of synchronous communication during online collaborative writing activities (Ciekanski & Chanier, 2008). Several other studies are documented on the Mulce portal. Figure 6. Matrix and graphical representation of e-mail exchange (Reffay & Chanier, 2003). Detecting undesirable learner behaviors. Datasets derived from PLEs provide a rich source of data to detect unusual behavior, as these datasets record actions of learners with several tools they were using during the classes. Data from LMSs can be used to detect potential dropouts when learners are no longer active. ITS datasets are also suitable for research on unusual behavior. Baker et al. (2006) found that learners who were “gaming the system” (i.e., fast and repeated requests for help to avoid thinking) had the largest correlation with poor learning outcomes. Figure 6. Pattern visualization of UC3M dataset (Scheffel et al., 2011). Scheffel et al. (2011) used the UC3M dataset to identify key actions from observed learning behavior. The authors employed data mining techniques to extract frequent patterns of actions. These patterns were visualized to support teaching activities. For instance, the pattern illustrated in Figure 7 points to development flows in which for each compilation students opened a file and closed it again before compiling. According to the teaching staff, such actions translate into a significant increase in development time and should be corrected. Detecting affects of learners. Some datasets are suitable for research on the detection of affects and motivational aspects. For instance, PSLC DataShop datasets can be used to extract motivational aspects by comparing the time a learner spends on a learning activity in an ITS with the expected or average time of other learners. The use of emoticons and affective words is researched in social interaction datasets (Reffay et al., 2011). Prominent research in building a user model of affect from real data has been conducted by Conati and Maclaren (2005). Ongoing research with the UC3M dataset is focused on the detection of affects of learners, such as frustration and (dis-)engagement. Based on an analysis of sequences of actions, such as a sequence of error messages of a debugger and successful compilations, information is deduced about potential engagement or frustration. Conclusion and future challenges. In this article, we have presented an overview of datasets that can be used for exploratory research on LAK. Several datasets have been identified and analyzed along t"

About this resource...

Visits 223

0 comments

Do you want to comment? Sign up or Sign in