Desperately Seeking Subscripts: Towards Automated Model Parameterization

InProceedings

Jack Mostow

Proceedings of Educational Data Mining, 2011

2011 2011

This paper addresses the laborious task of specifying parameters within a given model of student learning. For example, should the model treat the probability of forgetting a skill as a theory-determined constant? As a single empirical parameter to fit to data? As a separate parameter for each student, or for each skill? We propose a generic framework to represent and mechanize this decision process as a heuristic search through a space of alternative parameterizations. Even partial automation of this search could ease researchersâ€™ burden of developing models by hand. To test the frameworkâ€™s generality, we apply it to two modeling formalisms â€“ a dynamic Bayes net and learning decomposition â€“ and compare how well they model the growth of childrenâ€™s oral reading fluency.

"1. INTRODUCTION. This paper addresses the problem of defining parameters, more precisely how specific to make them. For example, the parameters in a knowledge tracing model are the probabilities of already knowing a skill, learning it from a practice opportunity, guessing an answer without knowing the skill, or answering incorrectly despite knowing the skill. But how specifically should these parameters be defined? Should we use a different parameter for every skill? For every student? For every pair? The last option would generate too many parameters to fit from the available data. Corbett et al. [Corbett and Anderson, 1995] decided to make the knowledge parameters (probabilities of knowing already or learning) skill-specific, and the performance parameters (probabilities of guessing or slipping) student-specific. They judged that the knowledge probabilities vary more by skill than by student, whereas the performance probabilities vary more by student than by skill. Such decisions â€“ how specific to make a given parameter in order to predict unseen data â€“ are the focus of this paper. This subtle but crucial modeling decision is typically made by hand, often by trial and error. The researcher explores various alternatives, trading off theoretical plausibility, computational tractability, model fit, statistical reliability, interpretability, and informativeness with respect to the research questions of interest. This problem falls in the domain of model selection but differs from prior work on selecting structure [e.g., Madigan and Raftery, 1994] or variables [e.g., Negrin et al., 2010] in that we focus on selecting a specific parameterization of the given variables. We propose a generic framework to represent and mechanize this process. To test its generality, we apply it to two types of student learning models (dynamic Bayes nets and learning decomposition), which we train and test on childrenâ€™s oral reading fluency data. 2. A HEURISTIC SEARCH SPACE OF MODEL PARAMETERIZATIONS. The title of this paper refers to model development as a search for subscripts because subscripts indicate the specificity of the parameters they index. To formalize this search space, we represent each state in the space as a vector with an element for each parameter in the model. For example, consider a dynamic Bayes net model of Knowledge Tracing (KT), with probabilities for guess, slip, forget, learn, and already know. We represent a parameterization of this model as a vector of 5 elements, each of which specifies how the corresponding parameter is subscripted. For readability, we write the value of each element as a phrase describing how the parameter is indexed, e.g. â€˜by studentâ€™, â€˜by skillâ€™, â€˜by student levelâ€™. Formally, we define a parameterization of a model with m parameters p1, p2, â€¦, pm as a vector of m split functions (F1, F2,â€¦, Fm), each of which specifies how to index the corresponding parameter over a set of size N, which we call the size of the split. For example, to fit the guess, slip, and learn parameters of a KT model separately for each student, we use the â€˜by studentâ€™ function to split them into separate parameters guessj, slipj, and learnj for each student j, so its size is the number of students. Likewise, to fit the already know parameter separately to the data for each skill, we use the â€˜by skillâ€™ function to split it into separate already knowi parameters for each skill i, so its size is the number of skills. To set a parameter to a single value for all of the data, we use a function named â€œby NULLâ€ to leave the parameter as is, with no subscripts or splits. We may estimate its empirical value by fitting the data, or supply a theoretical constant. For example, for a KT model, we apply the â€œby NULLâ€ function to the forget parameter, and set its value to zero based on the theoretical assumption of no forgetting. We define the size of a parameterization as the summed sizes of its split functions. Intuitively, this quantity is simply the total number of subscripted parameters. The example parameterization above indexes three parameters by student and one by skill, so its size is 3 * # students + 1 * # skills. Given m parameters p1, p2, â€¦, pm and a set F of split functions, the cross product F m generates a search space of |F| m possible model parameterizations to consider. One simple but inefficient search strategy is brute force, searching for the best model over all expressible splits. Alternatively, one heuristic strategy is to search the space of parameterizations in order of increasing size, fitting the resulting parameterized model to the data, computing some measure of its (complexity penalized) model fit, and halting when we reach a local maximum. Note that the size of the parameterization is a crude measure of model complexity. 3. TWO DIFFERENT MODEL FORMALISMS. Dynamic Bayes nets (DBNs) provide a powerful way to infer a studentâ€™s changing knowledge over time from observed student behavior. We extended a previous DBN model of childrenâ€™s fluency growth [Beck et al., 2008] by adding an observable â€œDistributed Practiceâ€ node whose value is 1 for the studentâ€™s first encounter of the day for a given word and 0 otherwise. The resulting model (shown in Fig. 1) has 17 parameters, too many to list here. For example, the parameter â€œlearn | distributed practice, helpâ€ models the probability P(Kn = true | Kn-1 = false, Dn-1 = true, Hn-1 = true). We used BNT-SM [Chang et al., 2006] to express different parameterizations of the model and fit them to data. Fig. 1. Architecture of a Bayes Net Model of Childrenâ€™s Growth in Oral Reading Fluency. Learning decomposition (LD) estimates the relative impact on performance of different types of practice, such as wide vs. repeated reading and distributed vs. massed practice [Beck, 2006]. Using this approach, we developed the following model to predict a childâ€™s latency prior to reading a word aloud in text: FORMULA_1. Here E represents minimum latency, L scales latency as a linear function of word length, A reflects the latency at the first encounter of a word, and b represents the learning rate. The coefficient h represents the impact of a tutor-assisted encounter relative to an unassisted encounter. The coefficient m represents the impact of a massed encounter (i.e. of a word seen earlier that day) relative to a distributed encounter (i.e. of a word seen for the first time that day). The variable HM counts the number of assisted, massed encounters; HD counts assisted, distributed encounters; NHM counts unassisted, massed encounters; and NHD counts unassisted, distributed encounters. To fit different parameterizations of this model to data, we used MATLABâ€™s (Ver. 7.6.0.324) non-linear regression function. 4. EVALUATION. 4.1 Data. The oral reading fluency data for this paper comes from a random sample of 40 children, stratified by gender and reading level, from the students who used Project LISTENâ€™s Reading Tutor [Mostow and Aist, 2001] during the 2005-2006 school year, with a median usage of 5.7 hours. In total they attempted to read 5,078 distinct word types ranging in difficulty level from grades 1 to 11. The data includes each studentâ€™s unique user id, gender, reading level (from grade K to 6), and performance on each word encounter, which we define as fluent if accepted by the Reading Tutor as read correctly without help or hesitation. To partition the data into training and test sets, we ordered the distinct word types encountered by each student by the number of encounters. We assigned all the studentâ€™s encounters of odd-numbered word types to the training set, and all encounters of even- numbered word types to the test set, so as to be able to train and test models on all of a studentâ€™s encounters of a given word. Given the information in the data set, one set of possible splits is {â€˜by studentâ€™, â€˜by student levelâ€™, â€˜by genderâ€™, â€˜by wordâ€™, â€˜by word levelâ€™, â€˜by student and word levelâ€™, â€˜by student level and wordâ€™, â€˜by student level and word levelâ€™, â€˜by gender and wordâ€™, â€˜by gender and word levelâ€™}. We omitted the split â€˜by student and wordâ€™ because we had no overlap in pairs between training and test sets. 4.2 Results. Table I compares different parameterizations of DBN and LD models, ordered by size. The DBN models treat fluency as a binary variable, so we show the percentage accuracy of their predictions, both overall and within-class; the test data is unbalanced, with 72% of it in the positive (fluent) class. The LD models predict real-valued latencies, so we use Root Mean Squared Error (RMSE) to measure their accuracy. Since the models make different types of predictions, their accuracies are not comparable. Given the maximized value L of the likelihood function for the estimated model, the number k of parameters and the number n of data points in the training set, we compute AIC (Akaike Information Criterion). We estimate BIC (Bayesian Information Criterion). DBN and LD models use different likelihood functions. The likelihood function for DBN models is a probability, so their AIC and BIC scores are positive. In contrast, the likelihood function for a linear regression is a product of Gaussian probability density functions, so AIC and BIC scores for LD models can be positive or negative. Table 1. Accuracy and complexity of DBN and LD models on unseen test data for childrenâ€™s oral reading fluency. The best value(s) in each column are underlined. Which models are best? None of the DBN models substantially beats the majority class accuracy of 72%. The five simplest models have almost perfect recall (accuracy on positive examples), but very low accuracy on negative examples. Note that AIC and BIC do not vary smoothly with the size of the parameterization. For example, splitting by student level has size 136 and gives the worst AIC and BIC scores, while word level has size 170 but yields the best BIC score and a near-best AIC score. The â€˜by student and word levelâ€™ LD model has the lowest AIC and BIC scores. This fact suggests that students at the same estimated student level differ enough to model individually, possibly due to inaccurate estimates. In contrast, word level apparently captures adequate information about word difficulty. This model also achieves the best accuracy on unseen test data (RMSE = 0.18 sec). However, the second best accuracy is achieved by â€˜by wordâ€™ model, which has some of the worst AIC and BIC scores even though its size is not enormously larger (1518 vs. 1512). This disparity implies that AIC and BIC can be poor predictors of performance on unseen data. One problem we faced that due to splitting when the dataset size was very small (e.g. less than 4 data points) we failed to fit the LD model. We excluded these datasets and the size of parameterization became smaller than it should be in some of the models. Although the DBN and LD models have different formalisms and outputs, they are not directly comparable, we can still compare their performance profiles over the same space of parameterizations. In particular, is the same parameterization best for both models? No. For the LD models, the â€˜by student, word levelâ€™ parameterization achieves by far the best AIC and BIC scores. For the DBN models, this parameterization achieves close to the best AIC score, which is for the â€˜by student level and word levelâ€™ model, but so do the â€˜by word levelâ€™ and â€˜by gender and word levelâ€™ models. Moreover, its BIC score is mediocre. 5. CONCLUSION. This paper defines the problem of parameterization selection and formalizes it in terms of a space of parameterizations induced by split functions. It proposes a simple strategy to search this space in order of size, hill-climbing on complexity-penalized model fit. We implemented a prototype of this strategy restricted by using the same split function for every parameter to accommodate a limitation of BNT-SM. We demonstrated its generality by applying it to both DBN and LD models and evaluating the resulting parameterizations on the same data set. Future work includes expanding the search space to relax the restriction in the implementation, and devising search heuristics to go beyond size and complexity- penalized model fit and address additional criteria discussed in the Introduction. This work will succeed if it helps clarify, accelerate, or automate the discovery of good models."

Acerca de este recurso...

Visitas 193

Guardar en Mi espacio personal
Enviar enlace

Categorías:

Educational Data Mining (EDM)

Etiquetas:

0 comentarios

¿Quieres comentar? Regístrate o inicia sesión

¿Cómo puedes configurar o deshabilitar tus cookies?

Desperately Seeking Subscripts: Towards Automated Model Parameterization

InProceedings