Dynamic Cognitive Tracing: Towards Unified Discovery of Student and Cognitive Models

InProceedings

Jack Mostow

Proceedings of Educational Data Mining, 2012

2012 2012

"This work describes a uniï¬ed approach to two problems previously addressed separately in Intelligent Tutoring Systems: (i) Cognitive Modeling, which factorizes problem solving steps into the latent set of skills required to perform them [7]; and (ii) Student Modeling, which infers studentsâ€™ learning by observing student performance [9]. The practical importance of improving understanding of how students learn is to build better intelligent tutors [8]. The expected advantages of our integrated approach include (i) more accurate prediction of a studentâ€™s future performance, and (ii) clustering items into skills automatically, without expensive manual expert knowledge annotation. We introduce a uniï¬ed model, Dynamic Cognitive Tracing, to explain student learning in terms of skill mastery over time, by learning the Cognitive Model and the Student Model jointly. We formulate our approach as a graphical model, and we validate it using sixty different synthetic datasets. Dynamic Cognitive Tracing signiï¬cantly outperforms single-skill Knowledge Tracing on predicting future student performance."

"1. INTRODUCTION. We propose Dynamic Cognitive Tracing as a method that estimates from performance data: 1. A Student model. The estimate of a studentâ€™s knowledge of a skill in a given time. 2. A Cognitive Model. The skills a students require to solve a problem step. Letâ€™s illustrate the student modeling problem with an example. Suppose we are interested in modeling data from a reading tutor that listens to children read aloud. Figure 1 shows sample data in this scenario. We follow the convention of referring to the scorable steps in an intelligent tutor task as â€œitemsâ€ [27]. The input variable is the item id t , which in this case is the word read by a student at time step t. The target variable pt is the performance of the studentâ€“ in this case whether the tutor accepted the word read. The student reads the words â€œsmile because itâ€ correctly, but misreads the word â€œhappenedâ€. The student modeling problem is to predict future student performance. Existing student modeling techniques require cognitive models, assignments of items to skills [9]. This is a very expensive requirement, since it often depends on expert domain knowledge [4]. Figure 1: Reading tutor example of student modeling. For example, in our reading tutor scenario, it is not a trivial endeavor to cluster a dictionary of words into the set of skills needed to read them. Unfortunately, the success of existing methods for automatic construction of cognitive models has been limited [11]. Current methods for discovering cognitive models are restricted in that they cannot handle longitudinal data, or that they are not fully automatic. For example, Principal Component Analysis, Non-Negative Matrix Factorization [27] and the Q-Matrix Method [2] ignore the temporal dimension of the data. On the other hand, Learning Factors Analysis [7] is designed for temporal data, but it requires an expertâ€™s cognitive model. Our main contribution is a fully automatic approach to discover a cognitive model of longitudinal student data. Our goal is discovering student models, while simultaneously clustering similar items together. The rest of this document is organized as follows. Section 2 reviews related prior work. Section 3 describes our approach, Dynamic Cognitive Tracing, to jointly learn a student model jointly with a factorization of items into skills. Section 4 evaluates performance using synthetic data. Section 5 provides some concluding remarks. 2. RELATION TO PRIOR WORK. In this section we study Dynamic Cognitive Tracingâ€™s relation with prior work. Section 2.1 surveys previous approaches to learn student models. Section 2.2 summarizes automatic approaches for cognitive model discovery. 2.1 Student Modeling. Corbett and Anderson [9]â€™s seminal paper introduced Knowledge Tracing as a way to model studentsâ€™ changing knowledge during skill acquisition. It uses (a) a cognitive model that maps a problem solving item to the skills required, and (b) logs of studentsâ€™ correct and incorrect answers as evidence of their knowledge on a particular skill. Reye [22] showed that there is an equivalent formulation of Knowledge Tracing as a Bayesian Network. Knowledge Tracing has enabled signiï¬cantly faster teaching by Intelligent Tutors, while achieving the same performance on evaluations [8]. Knowledge Tracing, as well as Dynamic Cognitive Tracing, are non-convex problems. This means that the optimizer that estimates the parameters of the models might get stuck in local optima far away from the global optimum. Moreover, these formulations are also non-identiï¬able: There exist potentially many student models that may explain the data observed equally-well. In Knowledge Tracing, the main source of non-identiï¬ability is the trade-off between the probability of a studentâ€™s initial knowledge, and the probability of learning the skill [5]. To mitigate non-identiï¬ability, recent work has proposed the use of Bayesian priors [5] or using contextual clues to estimate whether a student has guessed [1]. Other approaches to student modeling include Performance Factor Analysis [19, 14], which predicts student performance based on the item difficulty and student historical performances. Alternatively, Learning Decomposition [6], uses non-linear regression to determine how to weight different types of practice opportunities relative to each other. More recently, Tensor Factorization [25], has been used to the student modeling problem. It use recommender system techniques to learn student models. None of these techniques aim to discover cognitive models. Thai-Nghe et al. [25] make use of latent variables, but they argue that it is not possible to interpret their semantics. Their formulation is tied to speciï¬c students, and it is not clear how to generalize their approach to unseen students in the training set, or when students encounter only a very sparse set of items. We designed Dynamic Cognitive Tracing aiming to discover latent factors with the interpretation of Cognitive and Student Models. Desmarais [11] argues that the construction of a cognitive model from data is highly desirable, not only to avoid the labor intensive task of specifying which skills are involved in which task, but because a data-driven approach might outperform human judgment. In the next subsection we study such approaches. 2.2 Automatic Discovery of Cognitive Models. Winters et al. [27] surveyed methods for automatic construction of cognitive models. Examples are matrix factorization techniques, such as Principal Component Analysis (PCA) and Non-Negative Matrix Factorization (NNMF). The theoretical relationships between different matrix factorization techniques has been studied in detail [24]. The Q-matrix algorithm [2, 3], is a hill-climbing method that creates a cognitive model linking skills and items directly from student response data. An alternative approach, Learning Factors Analysis [7], performs combinatorial search to evaluate and improve on existing cognitive models. None of the techniques reviewed in this section take into account the temporal dimension of the data without human intervention. To the extent of our knowledge, we are the ï¬rst ones to estimate a cognitive model completely automatically from data collected over time. 3. DYNAMIC COGNITIVE TRACING. We now describe Dynamic Cognitive Tracing. Subsection 3.1 details our approach. Subsection 3.2 provides pointers on the training and inference algorithms used. Subsection 3.3 shows how Dynamic Cognitive Tracing relates two common techniques used in student modeling and in automatic generation of a cognitive model. 3.1 Model. We formulate Dynamic Cognitive Tracing as a Bayesian Network. Bayesian Networks [20], are a popular framework to reason using noisy information. Bayesian networks are directed acyclic graphical models where the nodes are variables and the edges specify statistical dependencies between variables. Bayesian Networks are often described using plate diagram notation to show the statistical relationship between their random variables. The plate diagram of Dynamic Cognitive Tracing is shown in Figure 2(a). Instead of drawing a variable multiple times, we follow the convention of using a plate to group repeated variables. As an example, we unroll Dynamic Cognitive Tracing using two skills in Figure 2(b). The description of the generative story of the variables is described in Figure 3. We follow the convention of using dark-gray to color variables that are observable during both training and testing. Variables visible during testing only are colored in light gray. Latent variables, which are never observed, are denoted in white circles. The doubleline around variables is used to indicate that their value is calculated deterministically given its parents. The variables in Dynamic Cognitive Tracing are: â€¢ S is the number of skills in the model. â€¢ Ids is the number of items that the student can practice with the tutor. For example, in the case of a reading tutor, Ids is the vocabulary size. If the tutor is creating items on the ï¬‚y, Ids is the number of templates from where items are being generated. â€¢ Q is an Id Ã— S matrix that maps items to skills. Each row Qid is modeled as a multinomial representing the skills required for item id . For example, if Qidt = [0.5, 0.5, 0, 0], we interpret item idt to be a mixture of skills 1 and 2. In this example idt does not require skills 3 and 4. Q need not be hidden. If in fact Q is known, we can clamp the parameters to their known values. â€¢ qt is the skill for item idt . For example, qt = 1 iff skill 1 is required for item id t , qt = 2 iff skill 2 is required, and so on. qt is chosen deterministically as the row number id t of Q. â€¢ Ks,t indicates whether the student has the knowledge of skill s. Notice, there is a markovian dependency across time steps: if skill s is known at time t âˆ’ 1, it is likely to be known it at time t. Therefore, we also need to know which skills were active on the previous time step (i.e., ks,t depends on qtâˆ’1 ). For simplicity, in this work we treat each K as a binary variable (whether the skill is known or not). â€¢ ks,t is a binary variable that represents if the skill is known and required by the item idt . Hence, its value is computed deterministically by applying a dot product to its parents: ks,t is true iff skill s is required (qt = s), and the student has learned the skill (Ks,t = 1). â€¢ pt is the target variable that models performance. It is only observed during training. â€“ For discrete grades (i.e., right or wrong), a Binomial distribution or logistic regression can be used. The use of logistic regression in Bayesian Networks has been studied in the context of mixture of experts [16], and more recently for the multiple subskill problem in student modeling [28]. In this paper we use the Binomial approach. â€“ For continuous grades, (i.e., 0 âˆ¼ 100) linear regression can be used. Our main contribution is unsupervised estimation of the cognitive model Q from longitudinal data, while simultaneously estimating the student model parameters. In the next subsection we study how to learn the parameters of Dynamic Cognitive Tracing, as well as how to perform inference on it. Figure 2: Dynamic Cognitive Tracing as a graphical model. 3.2 Training and Inference. Dynamic Cognitive Tracing is formulated as a directed graphical model (Bayesian Network). We leverage existing technologies to quickly implement a prototype of Dynamic Cognitive Tracing. We used the Bayesian Network Toolkit [18] (BNT) for Matlab. As described in the previous subsection, the knowledge of a skill is dependent of its value on the previous time step. This kind of dependency is called a Markov Chain. Figure 3: Generative story of Dynamic Cognitive Tracing. Therefore, in Dynamic Cognitive Tracing, the student knowledge of S skills is modeled using S layers of Markov Chains. Unfortunately, this is not scalable, because exact inference on layers of Markov Chains that produce a single output is untractable: the runtime complexity grows exponentially on the number of layers [12]. Hence, we limit our study to a small number of skills. In future work we will implement inference techniques that scale better, like Gibbs Sampling. The name Bayesian Network is a misnomer, because it does not require to use Bayesian Estimation, as in fact, we used Maximum Likelihood Estimation to perform exact inference. BNT implements the Junction Tree algorithm [15], an inference algorithm that generalizes the the Forward-Backward algorithm that is used in Knowledge Tracing and Hidden Markov Models [21]. To estimate the parameters of the model, we use the Expectation-Maximization (E-M) algorithm [10]. Like all non-convex optimizaters, E-M is not guaranteed to ï¬nd the globally optimal solution. 3.3 Unifying Perspective. We now discuss how Dynamic Cognitive Tracing generalizes two common techniques for cognitive and student modeling. Figure 4: Two-skill models with one time step. Figure 5: Unrolled graphical model representation of oneskill student models. Cognitive models have been built by matrix factorization techniques [27]. Probabilistic Principal Component Analysis (PPCA) [26] is an example of such matrix factorization techniques. It is a formulation of the Principal Component Analysis algorithm using graphical models. The main advantages of this approach over conventional PCA, is that it can handle missing data, and it provides a probabilistic interpretation of the underlying factors. In Figure 4(a) we show the graphical model representation of PPCA when explicitly formulated to handle missing data. If the variable p is continuous, it is modeled with a Gaussian. If the variable p is discrete, it is model with a Binomial, using a logistic link function. Discrete PCA is also known in the literature as Logistic PCA [23]. Figure 4(b) shows the simpliï¬ed Dynamic Cognitive Tracing with two skills, when there is no temporal information available. The structure of both graphical models is very similar: in both cases, the performance is explained by latent variables that represent the skills. The main difference is that Dynamic Cognitive Tracing takes into account the knowledge of the skill estimated from the student model: the performance is explained by the latent knowledge of the skills. We hypothesize that the advantage of our approach lies in the fact that it is not limited to a single timestep like PPCA is. We expect that item-performance data to be very noisy, and that the temporal information would be useful to model skill acquisition. Figure 5(a) shows the graphical model representation of Knowlege Tracing with a single skill model, which is just a Hidden Markov Model. Figure 5(b) shows the unrolled single-skill Dynamic Cognitive Tracing (S = 1) counterpart. In this case the structure of Dynamic Cognitive Tracing is equivalent to Knowledge Tracing. 4. EMPIRICAL EVALUATION. In this section, we report results of using Dynamic Cognitive Tracing to predict future student performance using synthetically generated datasets. In the context of this paper, we decouple the problem of discovering the assignments of items to skills and the problem of discovering the number of skills. For our experiments, we assume the number of skills is known. In a real scenario, where the number of skills is unknown, it could be estimated by using cross-validation using a held-out set. We report our results using Dynamic Cognitive Tracing using the true number of skills. Dynamic Cognitive Tracing aims to discover the skills automatically without supervision. We compare if the cognitive model estimated by Dynamic Cognitive Tracing outperforms a cognitive model that assigns all of the items to a single skill. Therefore, as a baseline, we compare against Knowledge Tracing using a single skill. In all comparisons between Knowledge Tracing and Dynamic Cognitive Tracing, their parameters are estimated using the same training set. The testing and training sets do not overlap students. 4.1 Experimental setup. In this section, we describe the synthetic data sets generation criteria and the evaluation metrics. To generate the synthetic data sets, we use the generative story described in Figure 3, having each student encounter 25 items during training (sequence length = 25). In preliminary experiments, we noticed that by the 25th time step, most synthetic students learned. To have a more balanced test set that has roughly the same number of correct and incorrect answers, the sequence length of the test set is sampled randomly. We want synthetic data to be plausible; for example, the probability of answering an item correctly by guessing should be lower than the probability of answering an item correctly due to knowledge. Therefore, the synthetic datasets follow these constraints: â€¢ The learning probability, the probability of transitioning from not knowing a skill, to knowing it, lies in [0.01 . . . 0.45]. â€¢ The guess probability, the probability of answering correctly, given that the student does not know the skill, lies in [0.01 . . . 0.30]. â€¢ The slip probability, the probability of answering incorrectly, given that the student knows the skill, lies in [0.01 . . . 0.30]. Note that these constraints are only exercised for generating the data. None of our models make use of this prior knowledge. For simplicity, in this paper we limit studying cognitive models that have only one skill active per item, but Dynamic Cognitive Tracing does not make use of this information. We constrain the models to not learn the â€œforget probabilityâ€ (e.g., the transition probability from â€œknowingâ€ to â€œnot knowingâ€ is zero). Knowledge Tracing can sometimes provide bad parameter estimates. Beck and Chang [5] argued that when Knowledge Tracing performs badly, it is often because of incorrect estimation of the initial knowledge of the students (initial probabilities). We want to make sure that our results are better than Knowledge Tracing because of the strengths of Dynamic Cognitive Tracing, not because Knowledge Tracing got stuck in an â€œunluckyâ€ local optimum. Therefore, we constrain all of the students to not have any initial knowledge in our experiments. E-M is used to learn the parameters of the models. Knowledge Tracing and Dynamic Cognitive Tracing are initialized with random parameters, however, the emission probabilities (slip and guess probabilities) of Dynamic Cognitive Tracing are initialized using a single-skill model. We experiment running E-M using ï¬ve different random initializations. Unless noted otherwise, each dataset is divided in three parts: (i) a training set with 200 students, (ii) a development set with 50 students, used to choose the best out of ï¬ve random initializations of the E-M algorithm, and (iii) a test set with 50 students. Students do not overlap among the sets. We report the performance of our models using two metrics: â€¢ Average Per-item Likelihood. Likelihood is a common metric to evaluate models that ï¬nd latent structure [12]. It measures how likely a model is to predict the test set. It penalizes more heavily incorrect predictions with high-conï¬dence. More formally, let I be the number of students in the test set, let pi,t be the estimated performance of student i at time t, let pi,t be the real performance of the student and let Ti be the number of time steps for student i. Then we compute the per-item likelihood as: FORMULA_1. â€¢ Classiï¬cation Accuracy. Classiï¬cation accuracy measures how often the predicted performance matches the actual performance. Formally, let Î´(Â·) be the Indicator function that returns 1 iff its argument is true, and 0 otherwise. We compute the accuracy as: FORMULA_2. In the next section, we report all of the different parameter combinations of parameters we used to experiment. We did not perform any additional tuning besides the one reported in the next section. 4.2 Results. We create a total of 60 random synthetic datasets using the constraints explained in Section 4.1. All of them have four types of items (Ids = 4). We created twenty datasets with 2, 3 and 4 skills (S = 2, 3, 4), respectively. In Figure 6, the horizontal axis denotes the Likelihood of single-skill Knowledge Tracing. Figure 6: Average Likelihood of Dynamic Cognitive Tracing and single-skill Knowledge Tracing in 60 different data sets. Table 1: Dynamic Cognitive Tracingâ€™s worst performing dataset (highlighted in Figure 6). The vertical axis is the Likelihood of Dynamic Cognitive Tracing. The solid line divides the datasets in which Dynamic Cognitive Tracing performed better than Knowledge Tracing (upper left corner) and the ones in which it performed worse (lower right corner). The dotted lines represent the conï¬dence interval for the mean of the Likelihood of Knowledge Tracing. Dynamic Cognitive Tracing performs as well or above the baseline in a total of 52 (87%) of the datasets. Is estimating a cognitive model with Dynamic Cognitive Tracing better than assuming a single skill model? We compare the mean Likelihood of Dynamic Cognitive Tracing (Â¯DCT = 62.34, sDCT = 5.13), with the mean Likelihood of single-skill Knowledge Tracing (Â¯KT = 59.97, sKT = 5.18). The null hypothesis is that the mean Likelihood of both models is the same (H0 : ÂµDCT = ÂµKT ). We perform a two-tailed t-test, pairing on the datasets (n=60). We reject the null hypothesis H0 with conï¬dence p < 0.05. We conclude that Dynamic Cognitive Tracing outperforms Knowledge Tracing with a single skill assumption. In Figure 6 the arrow points to the dataset that performs the worst compared to the single-skill Knowledge Tracing baseline. The Likelihood of the true model is 65%, of Dynamic Cognitive Tracing is 57%, and of single-skill Knowledge Tracing is 61%. We now investigate why Knowledge Tracing outperforms Dynamic Cognitive Tracing on this speciï¬c dataset. Table 1 shows the parameters of the student model. We notice that both skillsâ€™ learning and slip probabilities are very similar. We run the E-M algorithm using 100 different random initializations for both Dynamic Cognitive Tracing and Knowledge Tracing. Table 2: Model Comparison Over Number of Skills. We use the same training set used for the highlighted dataset of Figure 6. To ensure more reliable results, we use a larger test set of 200 students (instead of 50 students). Figure 7 shows the Cumulative Distribution Function of the Likelihood over 100 random initializations. For a speciï¬c Likelihood in the horizontal axis, the vertical axis is the percentage of initializations with Likelihood found at a value less than or equal to . Figure 7 shows that the Likelihood of the true model is 62.6%. The best Likelihood of Dynamic Cognitive Tracing is 61.1%, and of single-skill Knowledge Tracing is 59.7%. Knowledge Tracing gets stuck in local optima in less than 5% of the restarts. On the other hand, for this dataset, Dynamic Cognitive Tracing gets stuck in local optima 99% of the time. While there is a Dynamic Cognitive Tracing solution that outperforms Knowledge Tracing, the E-M algorithm found it in 4% of the initializations. In Table 2, we aggregate the results of Figure 6. We report the mean performance of the parameters that generate the 60 synthetic data sets (True model), Dynamic Cognitive Tracing, single-skilled Knowledge Tracing (KT), and the classiï¬er that always predicts the majority class (Majority). We present the mean Classiï¬cation Accuracy and the mean Likelihood. Dynamic Cognitive Tracing has a similar Likelihood and Classiï¬cation Accuracy to the True Model and dominates Knowledge Tracing. Letâ€™s study a sample cognitive model estimated using Dynamic Cognitive Tracing. Here Qâˆ— is the True Modelâ€™s cognitive model from which the synthetic data was generated. An estimate Q, learned from data using our approach is: FORMULA_3. Figure 7: Cumulative Distribution Function of the Likelihood over 100 restarts (using the dataset highlighted in Figure 6). Figure 8: Time required (in mins.) to train a single restart. Figure 9: Classiï¬cation accuracy using different training set sizes. The estimated cognitive model has some uncertainty, but if we round Q to integer values, it matches Qâˆ— . In future work, we are interested in using Bayesian priors to encourage sparse entries in Q [13]. Bayesian estimation is not currently supported by the BNT toolkit in which we implemented our model. In Figure 8 we show how long it took to perform a single restart of Dynamic Cognitive Tracing and Knowledge Tracing. Although Dynamic Cognitive Tracing achieves better accuracy, its exact inference implementation does not scale well with the number of skills. We now try to simulate the effect of different amount of training data. For this, we experiment with 50, 100, 200 and 400 students. We observed that in the PSLC DataShop [17], a repository for student data sets, it is common for smaller datasets to have data from at least 50 students. We assess the performance of our approach using ten synthetic training sets with different number of students. For all experiments here, we used four different types of items (Ids = 4), and two skills (S = 2). In Figure 9, the â€œTrue modelâ€ line represents the classiï¬cation accuracy of the model using the parameters from where the synthetic data was generated. Table 3: Model Comparison Over Number of Items. The Knowledge Tracing line shows the performance of this approach, using a single skill. The results suggest that the approaches compared can achieve good performance even on a smaller datasets. Since we are actually clustering similar items into skills, the number of different items (Ids) may have an impact on the performance of our approach. We create ten sets with 4, 8 and 16 item types respectively (Id = 4, 8, 16). All of them have two skills (S = 2). In Table 3, we summarize the Likelihood and the Classiï¬cation Accuracy of different models. The true modelâ€™s parameters achieve the highest likelihood, followed by our approach, that dominates Knowledge Tracing. 5. CONCLUSION. We propose Dynamic Cognitive Tracing as a novel uniï¬ed approach to two problems previously addressed separately in Intelligent Tutoring Systems: (i) Student Modeling, which infers studentsâ€™ learning by observing student performance [9], and (ii) Cognitive Modeling, which factorizes problem solving steps into the latent set of skills required to perform them [7]. We provide empirical results using synthetic data supporting that our unsupervised approach is better than assuming that all items come from the same skills. Dynamic Cognitive Tracing signiï¬cantly outperforms Knowledge Tracing using a single skill assumption. We used the Bayesian Networks Toolkit to quickly prototype our approach. However, our prototype is limited in that (i) the inference algorithm used by the toolkit leads to complexity exponential in the number of skills, and (ii) the optimization algorithm gets stuck in local optima. We recommend implementing Dynamic Cognitive Tracing using approximate inference as future work. For simplicity, in this paper we limited our study to synthetic data of items that require a single skill. However, our formulation is capable of discovering items that require multiple skills. It is an empirical question that we leave for future work to understand how well Dynamic Cognitive Tracing performs in this context. We are also interested in comparing Dynamic Cognitive Tracing to other automatic methods that produce cognitive models from data, such as matrix factorization techniques [27]. An interesting alternative we leave unexplored is ï¬nding a cognitive model by ï¬rst clustering items into skills, and then using Knowledge Tracing with the discovered cognitive model. However, it is not clear how to learn the skill clustering from data that comes at different points of time. For example, it is not obvious how PCA could be applied to temporal data. To our knowledge, we are the ï¬rst ones to propose a fully-unsupervised method that combines student modeling with discovering a cognitive model. Acknowledgements. This work was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080628 to Carnegie Mellon University. JosÂ´ was pare tially supported by the Costa Rican Ministry of Science and Technology (MICIT). The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or U.S. Department of Education. We thank the educators, students, and LISTENers who helped generate, collect, and analyze our data, and the reviewers for their helpful comments."

About this resource...

Visits 182

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Dynamic Cognitive Tracing: Towards Unified Discovery of Student and Cognitive Models

InProceedings