Adaptive Test Design with a Naive Bayes Framework

Alejandro Villarreal

Michel C. Desmarais

Michel Gagnon

Proceedings of Educational Data Mining, 2008

2008 2008

Bayesian graphical models are commonly used to build student models from data. A number of standard algorithms are available to train Bayesian models from student skills assessment data. These models can assess student knowledge and skills from a few observations. They are useful for Computer Adaptive Testing (CAT), for example, where the test items can be administered in order to maximize the informa- tion they will provide. In practice, such data often contains missing values and, under some circumstances, missing values far outnumber observed values. However, when collecting data from test results, one can often choose which values will be present or missing by a consequent test design. We study how to optimize the choice of test items for collecting the data that will be used for training a Bayesian CAT model, such as to maximize the predictive performance of the model. We explore the use of a sim- ple heuristic for test item choice based on the level of uncertainty. The uncertainty of an item is derived from its initial probability of success and, thus, from its difficulty. The results show that this choice does affect model performance and that the heuristic can lead to better performance. Although the studyâ€™s results are more exploratory than conclusive, they suggest interesting research avenues.

1. Uniform: Uniform random samples of missing values. 2. Most uncertain: Higher sampling rate of missing values for uncertain items (favoring the choice of average difficulty items). 3. Least uncertain: Lower sampling rate of missing values for uncertain items (favoring the choice of difficult and easy items alike). Figure 1: Sampling probability distribution of items used for the most uncertain and least uncertain sampling schemes. Uncertain items are the items that, obviously, are closest to an initial probability of 0.5. For the most uncertain and least uncertain conditions, the probability of sampling is based on the x = [0, 2.5] segment of a normal (Gaussian) distribution as reported in Figure 1. The probability of an item being sampled will therefore vary from 0.40 to 0.0175 as a function of its rank, from the most to the least uncertain item on that scale. Items are first ranked according to their uncertainty and they are attributed a probability of being sampled following this distribution. The distributions are the same for both conditions (2) and (3), but the ranking is reversed between the two of them. For the uniform condition (1), all items have equal probability of being sampled. Ten samples are created according to the three sampling schemes above. They are used to validate the effect of the sampling scheme by performing CAT simulations and measuring the predictive power of the models based on different sampling schemes. 4.1 Simulation process. The experiment consists in simulating the question answering process with the real subjects. An item is chosen and the outcome of the answer, success or failure, is fed to the inference engine (POKS). An updated probability of success is computed given this new evidence. All items for which the probability is above 0.5 are considered mastered and all others are considered non-mastered. We then compare the results with the real answers to obtain a measure of how accurate the predictions are. The process is repeated from 0 item administered until all items are â€œobservedâ€. Observed items are bound to their true value, such that after all items are administered, the score always converges to 100%. The simulations replicate a context of computer adaptive testing (CAT) where the system chooses the question items in order to optimize skills assessment. The choice of item relies on a measure of the most informative question that gets administered to the examinee. This can be achieved in a number of ways and the results are often relatively close. We use a heuristic that our exploratory results has shown to approach the performance of the information gain approach (see [5]), but which is computationally much faster. It consists in choosing the item i that has a high entropy and is highly connected to other nodes: FORMULA_3. where E(i) is the entropy of item i and links(i) is the number of incoming and outgoing links. The use of the maximal entropy of a item, E(0.5), in the above equation is a normalizing factor that ensures the weights between the number of links and the entropy are similar. Once the outcome of the answer to a question item is obtained, the probability of success to each other questions is then recalculated according to the POKS framework described above. Simulations consist in ten-fold cross-evaluation runs. Each run consists of a different random sampling for test design (the choice of items according to the three schemes described in section 4) and a different random sampling of the examinee used for training and testing. We report the average results of the 10 simulations for each experimental condition. 4.2 Data sets. Four data sets are used for the simulations. They are based on real data from tests in four different domains : Table 1 reports general statistics on these data sets as well as the sizes of the training and testing samples used for the simulations. Table 1: Data sets. The proportion of missing values inserted in the training set is half of the data. The testing data sets contain no missing values. 1. College math: a 60 question items test covering different topics in mathematics, from general high school math, to college level geometry, linear algebra, and calculus. The test was administered to 426 candidates newly admitted at an engineering school. 2. UNIX: a 34 question items test covering knowledge of the UNIX shell commands, from the basic â€œchange directoryâ€ (cd) to advanced data manipulation commands with awk. The test was administered to 48 individuals with a wide variety of knowledge about UNIX. 3. Arithmetic: a 20 questions test on basic fraction arithmetic. The test was administered to 149 pupils in grade 10 to 12. More details can be found in [13]. 4. French: a 160 general French grammar, reading, and comprehension test administered to 42 adults. 5 Results. The results of the simulation experiments are reported in Figures 2(a) to 2(d). The Y axis represents the proportion of correct predictions while the X axis reports the number of items administered. As mentioned, administered items are considered correctly classified, and thus, after all items are administered, the score reaches 100%. Given that the items are initialized to their unconditional probabilities, the prediction score generally starts above 70%, which indicates that more than two thirds of items are already correctly classified initially. A 90% confidence interval over the 10 simulations is reported around each data point. Each figure contains four curves. For comparison purpose, we report the Full condition which corresponds to the results for the full data set, without any missing values. The other three conditions are described in section 4. The results show non significant differences for the Arithmetic and College math data sets. However, more significant differences are observed for the other two data sets (UNIX and French), and they follow a regular pattern: The least uncertain condition systematically outperforms the most uncertain condition, which, in turn, performs systematically worst than the uniform condition. As expected, the full data set is systematically better than, or equal to, the data sets that contains missing values. 6 Discussion. These results suggest that higher sampling rates for the least uncertain items generally bring a higher predictive performance than for the most uncertain or the uniform choice, although this gain is not systematic. These results remain exploratory and a number of questions are left open. For one, how should we explain the patterns of differences found between the least uncertain, and the most uncertain? We initially hypothesized that the most uncertain items are the ones that would benefit the most from a higher sampling frequency. These items are generally the ones that bring the most information, and it seems reasonable to gather more data for them to correctly establish their relations to other items. Figure 2: Results of the four data test. 90% confidence intervals are displayed around each data point. However and contrary to these expectations, higher sampling of uncertain items yields the models with the poorest performance. One plausible explanation is that the estimation of the modelâ€™s conditional probabilities is more subject to noise and to miscalibration for probabilities closer to 0 or 1 than for mid-range probabilities. As a consequence, a higher sampling rate for these items is required. This hypothesis also explains why we observe a large difference between least uncertain and most uncertain for small data sets (UNIX and French) than for the larger ones (College math and French): larger data sets are not as subject to sampling noise as smaller ones are. Another open question is whether these results apply to other domains and to other Bayesian frameworks for student modeling. For example, would the results be the same if we used a more general Bayesian Network approach that captures independence relations, such as in [13]? A potentially interesting question to investigate is the hypothesis that the items that would be most informative are the ones that are central and highly connected in a Bayesian Network, that is, the nodes that are likely to influence the greatest number of nodes in the network. These nodes could benefit from a higher sampling rate. Moreover, given an initial topology of a Bayesian Network, we could guide the sampling beyond individual nodes, to pairs or to n-tuples of nodes that are deemed more critical. However, the topology of a Bayesian network might not be reliably established with small sample sizes, in contrast to the heuristic that we used which is based on estimating the individual items non conditional probabilities: they require relatively small sample size. It is feasible to design the tests with an initial sample of a few tens of data records, and then collect a larger sample for estimating the conditional, joint probabilities. Whether this can be effectively done for a Bayesian Network remains open. Further analysis and investigations are obviously required to bring some understanding to these results. Nevertheless, this investigation shows that we can influence the predictive performance of a Naive Bayes framework with partial data when we have the opportunity to select the missing values. It opens interesting questions and can prove valuable in some contexts of application.

¿Cómo puedes configurar o deshabilitar tus cookies?

Adaptive Test Design with a Naive Bayes Framework

InProceedings