Using Item-type Performance Covariance to Improve the Skill Model of an Existing Tutor

InProceedings

Hao Cen

Kenneth R. Koedinger

Lili Wu

P. Pavlik Jr

Proceedings of Educational Data Mining, 2008

2008 2008

Using data from an existing pre-algebra computer-based tutor, we analyzed the covariance of item-types with the goal of describing a more effective way to assign skill labels to item-types. Analyzing covariance is important because it allows us to place the skills in a related network in which we can identify the role each skill plays in learning the overall domain. This placement allows more effective and automatic assignment of skills to item- types. To analyze covariance we used POKS (partial order knowledge structures) to analyze item-type outcome relationships and Pearson correlation to capture item-type duration relationships. Hierarchical agglomerative clustering of these item-types was also performed using both outcome and duration covariance patterns. These analyses allowed us to propose improved skill labeling that removes irrelevant item-types, clusters related types, and clarifies the optimal temporal ordering of these clusters during practice.

1.) , the probability of getting item-type B right, given A was right , the probability of getting item-type A wrong, given B was wrong cp , the minimum probability that and ( | )P B A ( | )P A BÂ¬ need to hold Â¬ , the error of the POKS tests, which may be set differently for different tests cÎ± A BN âˆ§ , the number of times that students get A right and B right* A BN âˆ§Â¬ , the number of times that students get A right and B wrong* A BNÂ¬ âˆ§ , the number of times that students get A wrong and B right* A BNÂ¬ âˆ§Â¬ , the number of times that students get A wrong and B wrong* âˆ— Because the independence of observations assumption of the statistical tests was strained when considering repetitions of the same item-type for the same student, these values were normalized by dividing the each by the total so that they summed to 1. The statistical tests then assumed a number of degrees of freedom equal to the number of subjects in each pairwise comparison. This correction is overly conservative, but provides an unbiased correction for the sometimes great between-subjects variability in the N of repetitions. ( , , )CDFBinomial x n p , the cumulative density function of a binomial distribution of n trials and p success probability The idea of POKS is that if A Bâ‡’A Bâ‡’ perfectly, we would expect , and that the contingency table shows a lack of independence . In reality, due to noise and imperfect , we would expect the above two equalities not to hold exactly. Thus we can setup tests such that if and ( | ) 1P B A = | )B ( | )P A BÂ¬ Â¬ =1 ( | )P B A (P AÂ¬ Â¬ are above some threshold cp , we can have some confidence of A Bâ‡’ c Therefore, for there to exist a relationship between A and B three tests must succeed. The first two tests check that and are above some threshold( | )P B A ( |P A BÂ¬ Â¬ ) p , given the allowed test error cÎ± . The third test verifies whether the conditional probability and are different from and . (P B | )A (P A |Â¬ Â¬ )B ( )P B (P )AÂ¬ Test 1 returns true if ( , ,1 )A B A B A B c cCDFBinomial N N N p Î±âˆ§Â¬ âˆ§ âˆ§Â¬+ âˆ’ < . Test 2 returns true if ( , ,1 )A B A B A B c cCDFBinomial N N N p Î±âˆ§Â¬ Â¬ âˆ§Â¬ âˆ§Â¬+ âˆ’ < . Test 3 returns true if the 2*2 contingency table of A BN âˆ§ , A BN âˆ§Â¬ A BNÂ¬ âˆ§ and passes a A BNÂ¬ âˆ§Â¬ 2Ï‡ test with error rate cÎ± . 3.2 Clustering based on conditional log odds. As we can see by examining the tests, they rely on the contingency table that is tabulated for each pair-wise item-type comparison. These contingency tables create a covariance structure that â€œplacesâ€ each item-type in the POKS graph relative to the other item-types. By reflecting on this we can see that if we want to cluster the items based on the similarity of the required proficiencies, which would imply they require the same skills, we need a distance metric for item-types that a) captures that two items co-vary and b) can cope with the fact that two items may not be equally difficult despite having the very similar covariance structures. Requirement a means we need a distance metric that captures the structure of the contingency tables for item-type X1 as compared to the contingency tables for item-type X2. Requirement b means that this metric probably should not capture the structure of the tables relative to the outcome of performance X1 or X2. Rather, we should describe a distance metric that is computed conditionally for those cases where X1 or X2 is a success or failure. Requirement b is important for the purpose here because the tutor introduces item-types in a fixed order. This fixed order means differences in average performance between item-types may be caused by learning. However, this difference in performance between item-types that represent the same skill should not greatly alter the contingencies given the response is a success or failure. To do this comparison of the covariance structure it helps to consider the data for two item-types (X1 and X2) as being organized into two vectors of contingency tables describing these item-types relationship with all other possible item-types (Yn). If we consider that each contingency table is organized with X frequency results for A item- type and Y frequency results for B item-type, then for each Xn by Yn contingency table we individually computed 2 values: one for when Xn is a success and one for when Xn is a failure. In each of these 2 cases, the log odds of B vs. ~B frequencies is used to capture the strength of the odds B:~B on a continuous scale. Because these log odds do not capture the effect of frequency of A or ~A and only capture the relative frequency of B vs. ~B they are not reactive to learning of A that does not affect the patterns of B vs. ~B, nor are they reactive to difference in the n of observations of the B:~B results. Using this procedure we computed these 2 log odds (one for A and one for ~A) for each contingency table for each vector of contingency tables (X1 or X2). At this point we can describe vectors of log odds values for each column item-type X1 and X2 (getting 2 values conditional on A and ~A for each item Xn by Yn pair) and compute their Pearson correlation to determine the nearness of the two item-types in the knowledge space. To do this clustering we used a simple agglomerative hierarchical clustering to cluster item-types into a new grainsize which implies clustered items share the same performance requirements (skill). This new method shares similarities with correlation clustering methods that have proven useful for graph partitioning [7] and is described further in the next section. 3.3 Integrating duration covariance information. Previous work to understand the knowledge space has focused exclusively on how performance success or failure can be used to determine ordered structures. However, besides possessing success data, we also had data on the duration of each item-type performance. This data allowed us to compute pairwise duration correlations (r values) of the item-types that correspond with the POKS tests for each pairwise item-type relationship. While it was perhaps possible to use these correlations in some joint function with the strength of the result of the POKS tests for each pair, at this point we just used these values as an additional filter on which POKS implications we accepted as significant. For this paper we choose to exclude any AÃ†B pairs where r < 0. More importantly the duration correlation vectors created for each item-type were themselves correlated to produce values that represented the degree of similarity in the duration relationships between item-types. This statistic for each pair of item-types was multiplied by the correlation from the outcome based (log odds vectors) correlation above, and the item-type pair with the highest correlation product is clustered in each step of the simple agglomerative clustering. Clustering continues until the correlation of pairwise correlation vectors is above the clustering coefficient. 4 Results. Figure 1 shows the POKS graph obtained from this analysis and corresponds to the groupings in Table 1 which provides additional statistics to help interpret the results. The ovals in Figure 1 represent collections (or individual) item-types which were a function of the clustering procedure (also grouped in Table1). Item-type Label indicates the following information (probability correct_Unit name_section number_Skill ID number). The table also provides the average duration and total number of database observations (Rpsâ€”repetitions in 1000s) for each item-type. Colors indicate the majority unit membership for the grouped item-types, where LCM â€“ least common multiple unit, GCF â€“ Greatest common factor unit, and FracRep involves a visual and written fraction representation unit. Edge labels provide the average pc value and the duration correlation r. 4.1 Irrelevant knowledge components. The analysis failed to find any covariance relationship for 17 of the item-types in the 3 units of the tutor. These knowledge components provide an example of how this method can be used to suggest proficiencies that do not covary with the other item-types in the tutor. These so-called irrelevant skills tend to be knowledge components with higher probability correct because in cases with higher probability correct there is less chance to get the examples of not A and not B that are needed to pass the 2nd binomial test. This means that these items are found to be irrelevant because they are so easy that it is less likely to detect how they influence other item-types even where such relationships might exist given a similar problem with higher difficulty. While further analysis would be necessary to determine if these skills were truly irrelevant, the method has provided us with an initial hypothesis about which skills might be removed from the tutor so that time saved to spend on item-types with stronger relationships with the other tutor content. 4.2 Redundant knowledge components. The clustering of item-types indicates that the pattern of success contingency tables and the pattern of duration correlations were similar for these item-types such that if item- types X and Y are in a cluster it indicates they have similar relationships to the other item-types. By extension we can suppose that this similar place in the covariance structure suggests that performance for these item-types is constrained by the same skill. The fact that this clustering occurs suggests that the human coders used statistically irrelevant features to code the item-types. For example consider the green ovals in Figure 1. The right oval includes a variety of item-types that might be described as understanding the denominator, while the left oval includes item-types that deal with the numerator. Table 1. Key for graph. Much of this clustering may be necessary because the human coders were instructed to code in as fine a grain as practical. This instruction led to different skills being coded depending on whether the stimulus was a vertical bar, horizontal bar, circle, square, or number line. In contrast, the clustering method lumped these skills together indicating they may be actually the same proficiency. By splitting these groups into separate skills the human coder delinked these proficiencies relative to the tutorâ€™s automatic scheduling mechanisms. So, for example, if a student does very well on these clustered item-types as they are introduced, it will not result in less practice for the other items that our analysis suggests are in the cluster. Therefore by proposing these clusters we can address learning of the concept more efficiently because we can model transfer between item-types that are controlled by the same underlying proficiencies. Modeling transfer between item- types allows us to know when a particular concept, skill or procedure has been mastered despite the fact that we may not have given a student examples of all the item-types in the cluster. (Also note that sometimes the human coders did repeat the same skill IDs for isomorphic item-types in different sections of the same unit. As we can see in Table 1, our clustering method tended to confirm these human skill labels by clustering these item-types. E.g. skill id 33 (and others) appear twice in the same cluster indicating that the model agrees with human coders decision to label these item-types with the same skill despite the fact that they are in different sections of the same unit.) Figure 1. Graph structure described in the results section. 4.3 Ordering of knowledge components. The data comes from a tutor where the units follow a fixed order, and we can use our analysis to question the appropriateness of that order. As discussed in the introduction, we assume that introducing a prerequisite before its post-requisite will result in better learning, because each new idea will have been more adequately prepared by the scaffolding from prerequisite practice. This analysis of the optimal order is more difficult (than analysis of clustering or irrelevant skills) because the tutor repeats item-types, and learning caused by this repetition might explain why a downstream item-type performs better than an earlier item-type. However, while ideally we would include both orders of performance of any pair of item-types in our sample, it still seems safe to infer that very strong prerequisite relationships are not determined mostly by learning effects. Take for instance the position of the two green clusters (fraction concepts) relative to the GCF (greatest common factor unit)_2_31 item-type. Skill ID 31 involves a word problem in which students must produce the other factor for each of 2 products when the first factor has just been supplied by the student, e.g. â€œYou have groups of 4 apples and 6 pears, what is the greatest number of equal sized groups of fruit you can make? (This is an ID 30 skill.) How many apples in each group? (ID 31) How many pears in each group? â€“ (also ID 31)â€. This dependence of skill ID 31 in section 2 of the GCF unit on the fraction concept clusters seems plausible since this contextualized problem involves the denominator concept of understanding that wholes can be divided and also the numerator concept that these portions must be composed of a certain count of parts. While this reasoning might normally seemed strained, the support from the graph implies that the FracRep item-types should be practiced before this contextualized GCF section if we want to respect the recommendations of the theory of part-whole training to address the prerequisite skill first. 5 Conclusions. Future work will focus on integrating this knowledge space analysis with tracking of individual skills such as is currently used in the Bridge to Algebra Tutor. By integrating the knowledge space analysis it appears that we can get a rich perspective on what student actions might deserve to be coded as independent knowledge components. As we discussed in the results, this perspective should improve the performance of the model that tracks repetition of single skills in the tutor because that model can be modified to remove irrelevant skills, made less redundant by clustering skills, and made to better conform to the theory that prerequisites should be trained before later skills. This integration may proceed as shown in work by Cen on the Learning Factors Analysis (LFA) method, which allows improvement in Cognitive Tutor models by searching a space of hypothetical skills for the combination that best fits previously collected data [8]. LFA starts with an initial cognitive model represented as a binary matrix that maps a collection of skills to each item-type (or item) and uses a set of customized item response models to evaluate the model fit produced by any given mapping of skills to item-types for a particular dataset. These binary matrices are based on the tentative judgments of human experts about the effect of the features of the item-types, and LFA can systematically incorporate those features into existing cognitive models by generating and searching for alternative skill labels as allowed for in the matrix. This method has been used by various researchers to evaluate cognitive models in geometry, physics and reading [9, 10] . However, the method still requires a domain expert to propose alternative labeling of skills along which the algorithm searches. The methods proposed in this paper show promising potential to combine the strengths of POKS, item-type clustering and LFA to answer various EDM research questions by allowing us to use POKS and item-type clustering to generate a starting or alternative binary skill matrix for LFA model search. Acknowledgements. This research was supported by the U.S. Department of Education (IES-NCSER) #R305B070487 and was also made possible with the assistance and funding of Carnegie Learning Inc., the Pittsburgh Science of Learning Center, DataShop team (NSF-SBE) #0354420 and Ronald Zdrojkowski.

About this resource...

Visits 206

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Using Item-type Performance Covariance to Improve the Skill Model of an Existing Tutor

InProceedings