Skill Set Profile Clustering: The Empty K-Means Algorithm with Automatic Specification of Starting Cluster Centers

InProceedings

Proceedings of Educational Data Mining, 2010

2010 2010

While studentsâ€™ skill set profiles can be estimated with formal cognitive diagnosis models [8], their computational complexity makes simpler proxy skill es- timates attractive [1, 4, 6]. These estimates can be clustered to generate groups of similar students. Often hierarchical agglomerative clustering or k-means clustering is utilized, requiring, for K skills, the specification of 2K clusters. The number of skill set profiles/clusters can quickly become computationally intractable. Moreover, not all profiles may be present in the population. We present a flexible version of k- means that allows for empty clusters. We also specify a method to determine efficient starting centers based on the Q-matrix. Combining the two substantially improves the clustering results and allows for analysis of data sets previously thought impossible.

"1. Set the 2K starting cluster centers mg appropriately in the K-dim hyper-cube (Sec. 4). 2. Create the cluster assignment vector A by assigning each Bi to the closest mg. 3. For all g, if no Bi is assigned to mg, i.e. FORMULA_4 4. Alternate between 2) and 3) until the cluster assignment vector A does not change. This algorithm continues to minimize the WC criterion with each step; the empty clusters make no contribution to the criterion value. We discuss the choice of appropriate starting centers in Section 4. Our k-means variation allows for empty clusters or fewer clusters than originally requested. This flexibility removes the constraint that there be one cluster per skill set profile. Early work in this area has relied heavily on small examples with K = 2, 3, 4 skills. With the advent of online tutoring systems and end-of-year assessment exams, the number of skills has grown considerably. It is not uncommon to be interested in K = 10, 15, 20, etc. For K = 10, say, we would have 210 = 1024 different skill set profiles. In practice, it would be extremely uncommon to see a sample with all 1024 different subgroups. Moreover, the large number of profiles computationally prohibits clustering of samples where N < 2K. Our k-means variation allows for the identification of the clusters/profiles that we do have; any computational constraints (e.g. memory, storage) are limited and are a characteristic of the operating system/platform and not of the algorithm. 4 Choosing Starting Centers. It is well-known than k-means can be dependent on the set of starting centers [9]. Given our goal of identification of the true skill set profiles in the population, a natural set of starting centers might be the hypercube corners Î±i = {Î±i1, Î±i2, ..., Î±iK} where Î±ik âˆˆ {0, 1}. If students map closely to their profile corners, k-means should locate the groups affiliated with the corners very quickly. However, even if all profiles are present, the students may not be near their profile corner due to attenuation of our skill estimates in the presence of multiple skill items. Below are two possible Q matrices for J = 24 items. In Q1, items 1-8 only require skill 1, items 9-16 only skill 2, and items 17-24 only skill 3 (all single skill items). In Q2, the first 12 items are single skill; the remaining items require multiple skills. If a studentâ€™s true skill set profile is {0, 1, 0}, (s)he should miss items 1-8, 17-24 in Q1 but receive a Bi2 of 1. In Q2, (s)he should miss items 1-4, 9-24 which correspondingly drops Bi2 from 1313 to. Similarly, a student with profile {1, 0, 1} will have Bi1 = Bi3 = 1 for Q1 but see a drop in capability from 1313 to 13 using Q2. (Analogous drops are seen in sum scores.) These attenuated estimates are not reflective of the true profiles. 4.1 Generating Response Data. To illustrate, we generate response data for N = 250 students for both Q-matrices from the deterministic inputs, noisy â€œandâ€ gate (DINA) model, a common educational research conjunctive cognitive diagnosis model [8]. The DINA item response form is FORMULA_5. The slip parameter s j is P(yi j = 0 | Î·i j = 1); the guess parameter g j is P(yi j = 1 | Î·i j = 0). Similar to the capability matrix (and the sum score), if student i is missing any skills required for item j, P(yi j = 1) decreases due to the conjunctive assumption. Prior to simulating the yi j, we fix the skills to be of equal medium difficulty with an inter-skill correlation of either 0 or 0.25 and generate true skill set profiles Î±i for each student. (In our work thus far, only a perfect inter-skill correlation has a non-negligible effect on the results.) These choices spread students among the 2K true skill set profiles. We randomly draw our slip and guess parameters (s j âˆ¼ Unif(0, 0.30); g j âˆ¼ Unif(0, 0.15)). Given the true skill set profiles and slip/guess parameters, we then generate the student response matrix Y and estimate their corresponding capabilities. Figure 1a below contains the capabilities estimated from the Q1 matrix, numbered by their true profile (slightly jittered for visualization purposes). The absence of multiple skills allows the mapping of the students to (near) their profile corners. Figure 1b contains the capabilities estimated via the Q2 matrix, also jittered, numbered by the true profile. The presence of multiple skills has pulled the non-{1, 1, 1} profiles toward the profile {0, 0, 0}. Using the hypercube corners as the starting centers for empty k-means in the second data set will make it more difficult to find the true groups. In fact, if there are no students within a cornerâ€™s octant (0.5 as the cutoff), that profile will not be found. When multiple skill items are included, the hypercube corners are no longer representative of the true profiles. We would expect their locations to be attenuated as well. Given the Q matrix, we map the true skill set profiles to their corresponding rescaled locations in the hypercube. Figure 1: a) Bi for Q1; b) Bi for Q2; Starting centers indicated with Xâ€™s. 4.2 Rescaling the Starting Centers. Let Î±p be the possible true skill set profiles where p = 1, 2, ..., 2K (e.g. {0, 0}, {1, 0}, {0, 1}, {1, 1} for K = 2). Let Ap j =âˆKk=1 Î±q jkpk . Then Ap j indicates whether or not a student with true skill set profile p has all the skills necessary to answer item j. If yes, Ap j = 1, 0 otherwise. Our starting centers Câˆ—p are then, for k = 1, 2, ..K, FORMULA_5. The numerator is counting the number of items with skill k that the skill set profile p could answer. The denominator is the number of items requiring skill k. (Note thatâˆ‘Jj=1 q jk = Jk. If we were using sum scores, we would not scale Câˆ—pk by the denominator.) If the Q matrix contains only single skill items, the starting centers return to the hypercube corners Î±i. In our example, the starting centers for Q2 would be, (as indicated by Xâ€™s in Figure 1b): (0, 0, 0); (4/13, 0, 0); (0, 4/13, 0); (0, 0, 4/13); (7/13, 7/13, 0); (7/13, 0, 7/13); (0, 7/13, 7/13); (1, 1, 1). These values are representative of the true profile locations given the Q matrix if all stu- dents answered items according to their true profiles. They are derived with respect to the conjunctive assumption made by the capability matrix (and the sum score). In practice, we would expect students to slip up or make some lucky guesses; however, setting the starting centers to these rescaled profile locations will allow the empty k-means (or even traditional k-means) to easily find the groups. With respect to missing profiles, we still use the full set of Câˆ—p as our starting centers and allow the algorithm to discard the unnecessary ones. Note that Ap j is similar in form to Î·i j in the DINA model. Although they serve a similar function, our approach is not unique to clustering DINA-generated data. The capability score (and the sum score) are reasonable estimates for any conjunctive CDM. As we will see in Section 5, we can similarly rescale the centers for use with other CDMs. 4.3 Performance. After calculating the corresponding B matrix, we cluster the students using hierarchical clustering (complete linkage) and traditional k-means, both asking for 23 = 8 clusters. We then re-cluster with traditional k-means and the empty k-means variation using the rescaled starting centers. (Note that the symmetry of the rescaled starting centers is a direct result of the balanced Q matrix; an unbalanced Q matrix will give asymmetric starting centers.) To gauge performance, we calculate percent correct as the correct classification rate based on the best one-to-one mapping of clusters to true skill set profiles. We also quantify the clustersâ€™ agreement to the true profiles using the Adjusted Rand Index (ARI), a common measure of agreement between two partitions [7]. Under random partitioning, the expected ARI value is zero. Larger values indicate better agreement; the maximum value is one. Table 1: Comparing Clustering Methods with the True Skill Set Profiles via % Correct, ARIs. All methods performed well; the rescaled starting centers resulted in the highest percents correct and ARIs. Our k-means variation (correctly) found 8 clusters. In order to assess the performance when not all possible skill set profiles are present, we then removed the three smallest profiles {(0, 0, 1); (0, 1, 1); (1, 0, 1)} (which is the most favorable situation for the other methods) and re-clustered. Table 2: Comparing Clustering Methods with a Subset of the True Skill Set Profiles via % Correct, ARIs. Again, all methods performed fairly well. Random starting centers for k-means showed a decrease in performance when clustering a subset of the profiles. Traditional k-means returned an error when using the rescaled starting centers since the initial cluster assignment returned empty clusters (as expected). Our k-means variation, however, found five clusters and had almost perfect agreement with the true skill set profiles. Even if we knew the true number of clusters (5), it is not a guarantee of superior performance. The five-cluster complete linkage solution was 93.5% percent correct with an ARI of 0.937. The traditional k-means (5 random centers) was 80.5% correct with an ARI of 0.679. Even when using only the five rescaled starting centers corresponding to the present profiles, the traditional k- means performance was comparable (97.6%, ARI = 0.946) to using our k-means variation which used the rescaled centers but required only an upper bound on the number of clusters. 5 Simulations. We explore the performance of our approach using two conjunctive CDMs while varying N, J, and K. For each simulation, the Q-matrix is randomly generated with a parameter dictating the percentage of single skill questions. We initially cluster all generated students and then remove a random number of profiles and re-cluster (the notation â€œâ€”â€ corresponds to errors in standard k-means). We first simulate from the DINA model (Section 4.1). Table 3: Performance with DINA-generated Responses: % Correct (ARIs). In all cases, the k-means variation with attenuated starting centers outperforms the other methods (via ARIs). We also noted in our simulations (not all presented here) that increas- ing the percentage of multiple skill items decreases the other methodsâ€™ performance while our k-means variation remains fairly steady. Moreover, in â€œclassroomâ€ size data sets, this variation identified the profiles present while other methods unnecessarily split the clusters. We also present results using responses generated from the noisy input, deterministic output â€œandâ€ gate (NIDA) model, another common conjunctive CDM. The item response form is FORMULA_5. where sk, gk are slip, guess parameters indexed on skill (rather than item); see [8] for further details. Responses are similarly generated; the results are comparable. Table 4: Performance with NIDA-generated Responses: % Correct (ARIs). 6 Conclusions. The modified k-means algorithm presented here, called â€œempty k-meansâ€, allows a more flexible approach to clustering for use in applications such as skill set profile clustering where the true number of clusters is not known, but may be bounded. It allows the user to specify a maximum number of possible clusters which removes the need to make a subjec- tive decision on the number of clusters. We define our attenuated starting cluster centers by the Q-matrix (giving us the hypercube corners in the case of all single skill items). As seen in the simulated results, in cases where all natural clusters were present, such starting val- ues gave superior clustering results (compared with both k-means with random starts and hierarchical clustering). In cases where some natural clusters were not present, the empty k-means algorithm with the defined starting values again had superior performance, while commonly traditional k-means would report an error due to empty clusters. Other appli- cations might fit this framework as well. For example, compositional data on the simplex would have natural cluster centers on the corners of hyper-triangle. Empty k-means could also be used to investigate both the validity of theorized cluster centers and the believed number of clusters. Further exploration of this approach is ongoing."

About this resource...

Visits 203

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Skill Set Profile Clustering: The Empty K-Means Algorithm with Automatic Specification of Starting Cluster Centers

InProceedings