Co-Clustering by Bipartite Spectral Graph Partitioning for Out-of-Tutor Prediction

InProceedings

Neil T. Heffernan

Zachary A. Pardos

Proceedings of Educational Data Mining, 2012

2012 2012

Learning a more distributed representation of the input feature space is a powerful method to boost the performance of a given predictor. Often this is accomplished by partitioning the data into homogeneous groups by clustering so that separate models could be trained on each cluster. Intuitively each such predictor is a better representative of the members of the given cluster than a predictor trained on the entire data-set. Previous work has used this basic premise to construct a simple yet strong bagging strategy. However, such models have one signiï¬cant drawback: Instances (such as students) are clustered while features (tutor usage features/items) are left alone. One-way clustering by using some objective function measures the degree of homogeneity between data instances. Often it is noticed that features also inï¬‚uence ï¬nal prediction in homogeneous groups. This indicates a duality in the relationship between clusters of instances and clusters of features. Co-Clustering simultaneously measures the degree of homogeneity in both data instances and features, thus also achieving clustering and dimensionality reduction simultaneously. Students and features could be modelled as a bipartite graph and a simultaneous clustering could be posed as a bipartite graph partitioning problem. In this paper we integrate an effective bagging strategy with Co-Clustering and present results for prediction of out-of-tutor performance of students. We report that such a strategy is very useful and intuitive, even improving upon performance achieved by previous work.

"1. INTRODUCTION. A signiï¬cantly large student population would usually have a wide variation in learning rates and knowledge levels. While there are numerous reasons for this diversity, three major reasons are related to: the type of instruction or help they respond best to, the way they are oriented towards learning and their levels of intellectual development [1],[2]. Needless to say, such differences would be reï¬‚ected in the way students interact with educational software, making educational data quite diï¬ƒcult to mine well. Speciï¬cally there are many educational data mining problems where the end goal is to predict the performance of a student on a given in-tutor or out-of-tutor task. In-tutor tasks include predicting the probability that a student will answer an item correctly after attempting a sequence of similar questions whereas out-of-tutor tasks include being to predict student performance in post-tests based on the data from their tutor usage. The idea that students are quite different makes it apparent that perhaps it is not such a good idea to ï¬t a global prediction model over the entire dataset for making predictions. In spite of the differences between students, educators commonly observe that students actually lie in very rough groups and have similar pedagogical needs. Taking a cue from this intuition, the task of prediction can be improved by clustering students into somewhat homogeneous groups and then training a separate predictor for each group. Such a predictor would obviously be a much better representative of students in that cluster as compared to a predictor which is ï¬t on the entire dataset. For example, it makes sense to have a different model for students roughly classiï¬ed as fast learners and a different model for slow learners than the same for both. This rather simple strategy of grouping students together and then modeling them separately can lead to improved performance in prediction and perhaps even better interpret-ability. While the above approach is compelling, there are two major issues with it. Firstly, while it is useful to model students as belonging to different groups, it is also known that such groupings are quite fuzzy and approximate. Students might actually possess different characteristics in varying degrees and what really sets them apart are certain dominant characteristics. For example students classiï¬ed as fast learners might actually be slow learners in certain skills. A fast learner might also belong to the group of students that are good at recalling information etc. Thus, such complex characteristics can not be possibly modelled by simply clustering students to a certain limit and then training models for each cluster. This â€œspreadâ€ of features in a student across groups also needs to be captured to make a distributed predictive model such as the above more meaningful. Such an issue can be resolved by varying the granularity of the clustering and training separate models each time so that such features can be accounted for. A simple yet quite effective strategy to do so was proposed by the authors and was seen to work quite well both in educational contexts (in-tutor predictions [3], out-of-tutor predictions [4],[5]) and more generally [6]. The second problem with the above approach is that clustering is implicitly suggested to be one-way i.e only clustering students. But this need not necessarily be the case and only clustering students would consider only half of the story. As an example, consider a matrix in which the rows represent students and the columns represent their responses to certain items. Clearly, clustering students would depend upon their item distributions, implicitly suggesting that for certain students certain items are more important than others. Similarly if items were to be clustered, they would depend on which groups of students get them correct (or incorrect) most frequently. This indicates a duality between these two clusterings, which on simultaneous co-clustering could be very useful in answering many research questions. Co clustering of such a student versus item matrix would pair clusters of student proï¬ciency to clusters of item performance which could be seen as a sort of a subject treatment interaction. This idea could be extended to the more general case of students and features rather than just items. In this work we use this idea of co-clustering students and their tutor interaction features and interleave it with the bagging strategy which was used with clustering [3],[4],[5],[6]. This combined approach is then used to predict the post-test scores of students. This paper is organized as follows: In Section 2 we discuss the idea of co clustering in more detail and that co clustering could be posed as a bipartite graph partitioning problem. In Section 3 we describe a general framework in which we interleave co clustering with the idea of generating an ensemble. In Section 4 we describe the experimental results which demonstrate the validity of this approach. In Section 5 we discuss the results and also describe some avenues for further work. 2. CO-CLUSTERING. Clustering is a fundamental tool from unsupervised learning for data analysis that groups together relatively homogeneous objects. The central idea for clustering is that every object could be speciï¬ed by a feature vector (or a point in the feature space) and then the degree of homogeneity between them could be measured by some objective function that uses these feature vectors. For example in k-means clustering: the points are grouped so as to minimize a distortion function, which is basically the sum of distances of all points from their assigned cluster centroids [7]. Clustering algorithms are one-way, i.e. one dimension of the data (say the rows of the data matrix) is clustered based on the similarities measured on the second dimension (say the columns). As pointed out in the previous section it might be desirable, quite frequently, to cluster along both the dimensions simultaneously, exploiting the apparent duality between them. Such simultaneous clustering can often offer interesting insights about the nature of interaction between the clusters at both the dimensions [8]. This utility is fast making co-clustering a fundamental tool for data analysis as is indicated by its widespread use in text and document mining [9], [10]; bioinformatics and gene expression analysis [11], [12]; collaborative ï¬ltering [13] and many others practical applications. While there are now a number of approaches to co-clustering such as based on spectral graph theory [10] and information theory [14], [15], each with its advantages, we consider the approach proposed by Dhillon [10] which formulates the problem of co-clustering as a bipartite graph partitioning problem. We now brieï¬‚y describe this approach starting with the relevant notation and deï¬nitions. 2.1 Notation and Deï¬nitions. A graph is represented as G = (V, E) where V represents the set of vertices and E represents the set of all edge weights Eij , where Eij is the edge weight between vertices {i, j}. Deï¬nition 1. The n Ã— n Weighted Adjacency Matrix of an undirected graph is deï¬ned as the matrix (mij )i,j=1,...,n . If mij = 0 it implies that vertices vi and vj are not connected by an edge. If mi,j = 0 it implies that the vertices {i, j} are connected and mi,j is the corresponding edge weight. Since the graph is undirected, mij = mji necessarily. Deï¬nition 2. Given the weighted adjacency matrix of a graph and a partition of the vertex set V into two disjoint subsets V1 and V2 , the cut between these two subsets is deï¬ned as: FORMULA_1. An undirected bipartite graph is a triple represented by G = (S, F, E) where S and F are two sets of vertices and E is the set of edges. Since it is a bipartite graph one end of the edges in set E have an endpoint in S and another in F. In our case the set S is the set of students while the set F is the set of features. The set of features could readily be seen as a set of item-responses as well. If F is the set of items, then an edge between si and fj exists if that item was answered correctly by a student and not otherwise. More generally, if F is just a set of features, then the edge {si , fi } simply represents the value of that feature scaled between 0 and 1 for that student. Given this deï¬nition of a Bipartite Graph, now we deï¬ne the adjacency matrix of the same. Consider a m Ã— n dimensional data matrix with students on the rows and the items or features on the columns. Letâ€™s suppose this matrix is given by A. Clearly, the adjacency of the bipartite graph is given as: FORMULA_2. The zeroes on the top-left and the bottom-right sub-matrices signify the absence of connections amongst the elements of S and F respectively (since connections in a bipartite graph can only run between S and F). The matrix M is represented such that taking A at the top right corner and AT at the bottom left implies that the ï¬rst m rows of M represent the set of students and the next n rows represent the set of features or items. Suppose the Bipartite Graphs (whose adjacency matrix is deï¬ned above) is partitioned into k clusters V1 , . . . , Vk . Given this partitioning, a corresponding set of student clusters S1 . . . Sk and corresponding feature clusters F1 . . . Fk would also be obtained. It could be intuitively seen that the best possible such set of clustering for all such pairs would be when the sum of all edges which cross between clusters is the minimum possible. As deï¬ned by [10] this corresponds to: FORMULA_3. Where V1 , . . . , Vk represents a k-partitioning of the graph. The above deï¬nition leads us to the Bipartite Graph Partitioning problem: Deï¬nition 3. The bipartite graph partitioning problem: Given a graph as deï¬ned earlier and subsets of V which are almost of equal size, say V1 and V2 . The required partition is FORMULA_4. The bipartite graph partitioning problem as deï¬ned above is NP-Complete. However, a good relaxation to this problem is given by spectral graph bi-partitioning. This relaxation is achieved via the graph Laplacian. The laplacian L of a graph is a symmetric positive semi-deï¬nite matrix such that its un-normalized form is given by L = D âˆ’ M where D is the degree matrix and M is the adjacency matrix as deï¬ned earlier. Note that D is only a diagonal matrix while M is a symmetric matrix with all zeros in the diagonal. Thus, the Laplacian encodes both D and M in it and has many useful properties such as being positive semi-deï¬nite, which make it very useful for tasks such as clustering [24]. One property of the Graph Laplacian that make it particularly suitable for clustering are related to the properties of its spectrum. The spectra of the Graph Laplacian unfolds the data manifold to give an lower dimensional embedding which can give â€œbetterâ€ clustering results. Returning to the Bipartite Graph Partitioning Problem, as demonstrated by Dhillon [10] and Mohar [24], the second eigenvector of the generalized eigenvalue problem Lz = Î»Dz gives a real relaxation to the problem of ï¬nding the minimum normalized cut Q(V1 , V2 ). The normalized cut is basically a cut that favours ï¬nding balanced partitions i.e. if the cut of two different partitions is the same, then the normalized cut is smaller for that partition which is more balanced. Thus it favours partitions that are balanced and have a small cut value. Clearly, the normalized cut is more suitable for tasks such as clustering [16]. Note that this relates to the ideas above relating to the optimal bi-partitionings in the following way: We want balanced clusterings with minimum cut for solving the bipartite graph partitioning problem, which would also be the optimal clustering for us. Thus looking at the Laplacian of the bipartite graph might provide such a clustering. 2.2 Spectral Co-Clustering. Given the deï¬nitions and notions in the previous section, in this section we state an algorithm [10] for ï¬nding the optimal co-clusters {S1 âˆª F1 }, . . . , {Sk âˆª Fk } as mentioned above. For that we deï¬ne the graph laplacian of a bipartite graph as such an optimal clustering can be found using a laplacian. Using the deï¬nition of L = D âˆ’ M as deï¬ned above and also the deï¬nitions of D and M . The laplacian may be written as: FORMULA_5. and. FORMULA_6. where D1 and D2 correspond to the degree matrices of A and AT respectively. If the generalized eigenvalue problem Lz = Î»Dz is written for the above laplacian for a bipartite graph and then rearranged, it has been demonstrated [10] that the resulting equations deï¬ne the equations for a singular value decomposition of the normalized matrix FORMULA_7. Thus instead of ï¬nding the second smallest eigenvector corresponding to the second eigenvalue, one could ï¬nd the left and the right singular values in its place. Finding the right singular value gives a bi-partitioning of students while the left singular value gives a bi-partitioning of the features. These can then be used to ï¬nd the optimal bi-partition as deï¬ned above. Algorithm 1. 1. Given the co-occurrence or data matrix scaled to between 0 and 1 A, form the normalized matrix. 2. Compute the second left and right singular vectors for An , concatenate them together to form a vector z. 3. Run k-means on this vector to obtain a simultaneous clustering of both the students and the features. This algorithm can be extended to a multipartition case if instead of ï¬nding the second singular values, the ï¬rst log2 (k) singular vectors are found. The rest of the process remains the same. Note that this algorithm gives a simultaneous clustering of the rows and the columns and is restricted in the sense that the number of row and columns clusters have to be the same. We modify this by running k-means two times. If the number of row clusters is k and then the number of column vectors is l, then we run k-means on the vector z twice, once to ï¬nd k clusters and then to ï¬nd l clusters. The ï¬rst m elements of the length m + n cluster assignment vector run will then correspond to the row clusters and the last n elements of the cluster assignment vector in the second run will correspond to the column cluster indices. 3. BAGGING STRATEGY. The statement of the supervised learning problem in machine learning could be roughly stated as follows: Given a training set consisting of ordered pairs of feature vectors and their associated labels (which might be discrete or continuous), the task of a learning algorithm is to learn a functional map from the feature space to label space. A learning algorithm is said to be more powerful if it is able to learn mappings such that it can generalize well and make correct predictions on test data-points on which it was not trained. Since the functional map under consideration might be highly non-linear, learning algorithms that output only a single mapping (frequently referred to as the hypothesis) might suffer from statistical, computational and representation issues that restrict them from learning good mappings. One way of solving this problem is to transform the feature space into a more suitable and â€œricherâ€ representation such that learning using this new representation gives much better functional maps as compared to the original representation. This is the motivation behind deep learning methods which have caused a new wave of excitement in the machine community since 2006 [17]. Another way of solving this problem atleast partly, is by using ensemble learning methods [18],[19],[20]. The basic idea behind ensemble methods is that they involve running a â€œbase learning algorithmâ€ multiple times, each time with some change in the representation of the input (e.g. only considering a subset of features in each run) so that a number of diverse predictions (or maps) could be obtained. This diversity in prediction is then exploited to get better predictions. Thus ensemble methods approach the said problem by both trying to learn multiple functional maps and also by learning a more distributed and hence â€œricherâ€ representation of the input space at the same time. In the next section we describe a method to use clustering for bootstrapping. 3.1 Clustering for Bootstrapping. In earlier work we introduced the idea of using clustering for bootstrapping [3], [4], [5], [6]. This idea was quite unlike other bagging methods which use a random subset to bootstrap. Thus, it had the potential advantage that the subsets used to bootstrap could be more interpretable. Before we generalize this methodology using co-clustering we ï¬rst brieï¬‚y describe the methodology using clustering. The training set was ï¬rst clustered into k disjoint clusters. A linear regression model was trained on each of the clusters only based on the training points assigned to that cluster. Since each such linear regression was a representative of only one cluster, we called it a cluster model. Thus, for a given k, there would be k cluster models. But since all the clusters are mutually exclusive, the training set is represented by all the cluster models taken together. This is called a prediction model (P Mk ). For an incoming test point on which a prediction is to be made, we ï¬rst identify the cluster that point belongs to. After the cluster has been identiï¬ed, the appropriate cluster model could be used to make a prediction for that point. Now note that we donâ€™t specify the number of clusters in the above. Hence, we can change the granularity of the clustering from 1 to some high value, say K. In each instance we would get a different prediction model (a special case would be P M1 , which would basically be when one linear regression model is trained on the entire dataset). Thus, we would obtain a set of K prediction models each of which would make a separate prediction on the test set. Since we vary the granularity of the clustering, each of these predictions are different, this diversity in prediction could be used by averaging all the (or half) the predictions obtained to get a single much stronger prediction. Figure 1: Finding a Prediction Model, P Mkl with k row clusters and l column clusters. 3.2 Co-Clustering for Bootstrapping. Note that the clustering is only one-way. That is, bootstrapping is done by only changing the data instances available for each cluster model (by changing the number of cluster models itself) but the number of features used in each case is the same. A cluster basically is a bunch of rows in the data matrix with all columns. A co-cluster on the other hand would be a â€œblockâ€ in the data matrix with a sub-set of rows and a sub-set of columns assigned to each â€œco-clusterâ€. Thus a co-clustering could be thought of as a simultaneous clustering and dimensionality reduction of the data. Note that a clustering is only a special case of co-clustering when the columns are not clustered at all (or have only one column cluster). Clearly, the above bagging methodology can be suitably modiï¬ed using co-clustering. For a given number of row clusters k and column clusters l we could have k co-clusters where-in each cluster has only some features assigned to it (note that the deï¬nition is symmetric i.e we could think of this as l co-clusters). For each co-cluster we train a separate linear regression model only using the data instances and features assigned to it. We thus obtain k Co-Cluster Models. Like in the above case for clustering, the combination of the k co-cluster models would be considered to be a Prediction Model which makes a single prediction on the test set. We can then vary k from 1 to some value K and l from 1 to some value L. By doing so, we would get a total of K Ã—L prediction models. We then average a subset of the predictions made by these models to obtain a much stronger prediction. Figure 2: Ordering the Co-Cluster Prediction Models, P Mkl. There are some interesting aspects to such a methodology using co-clustering. For k = 4 and l = 4, the grid in Figure 2 illustrates all the Prediction Models (P Mkl ) that could be obainted by co-clustering. The Prediction Model P M1,1 represented by (1, 1) is simply the case when there is one data cluster and only one feature cluster i.e the original data matrix itself. The prediction model for this case would simply be training a linear regression on the entire dataset, considering all the features. The ï¬rst column of this grid represents the case when the number of feature clusters is just one, while the number of row clusters are changed. Note that this is simply the methodology described above in Section 3.1 using clustering. The ï¬rst row of this grid is also equally interesting. In this case the number of row clusters is always one i.e the entire dataset is considered in all coclusters, while the column clusters are successively changed. It should be noted that this is a sort of a step-wise regression, where a linear regression is trained on the entire dataset but the number of features that are used to train it are changed (usually reduced as l increases). All the other cases are a cross between these two extreme cases. We see that it seems plausible that a bagging strategy using co-clustering if averaged properly could deï¬nitely have more predictive power as it generates diversity by considering a different subset of data instances and features each time, consequently also generating a much larger set of predictions. 3.3 Blending Predictions. As mentioned before, the method for combining the predictions returned by the various prediction models is a naive averaging strategy. When the prediction models were generated by clustering (P Mk ), we either averaged the ï¬rst K/2 predictions (where K was the maximum number of clusters) [6] or we learned the best number of prediction models that could be averaged by an internal cross-validation [6]. The averaging idea is not immediately straightforward when coclustering is used to generate the prediction models. This is because the prediction models are obtained by changing two parameters. It is also observed that prediction models with a high k or l return poor accuracies, thus it wouldnâ€™t be useful to average predictions from all the P Mk1 models ï¬rst and then P Mk2 models and so on (i.e. traversing the grid row-wise or column-wise). Since high values of k and l are counter-productive, we take the order of the prediction models such that the sizes of k and l increase uniformly. This ordering is illustrated by the curve in Figure 2. The ï¬rst half of this reordered set of predictions are then averaged. 4. EXPERIMENTAL VALIDATION. In this section we report experimental results for using coclustering for bagging and compare results with the benchmark (P M11 ) and clustering alone. 4.1 Dataset Description and Context. We primarily experiment with two datasets in this study. This data was collected to study if dynamic assessment, which has long been advocated as an effective method for assessment, was actually better than the traditional static assessment [21], [22]. Dynamic assessment is an interactive approach to student assessment which is primarily based on how much help a student requires during a practice test. Traditional static testing only takes into account the percentage of questions that the student gets correct. Feng et al. [23] showed that features that only recorded how much assistance a student got while interacting with a tutor alone were better predictors of student performance in post-tests held later in the year as compared to how many questions students got correct. This was conï¬rmed in subsequent studies [4], [5]. Thus if Co-Clustering is able to improve predictions, then this study could further lend weight to the idea that dynamic testing is indeed better than static testing and that we could further improve upon P M11 . It must be noted that P M11 would correspond to results reported in [23] which were better than static assessment. P M11 basically corresponds to the condition when all the dynamic features are considered and all of the training set is used to train a predictor. The datasets come from the 2004-05 and 2005-06 school years, the ï¬rst two full years when ASSISTments.org was used in schools in Massachusetts. ASSISTments is an elearning tutoring system developed at Worcester Polytechnic Institute which assesses students as it assists. These datasets contain features that measure the interaction of students with the tutor and their actual ï¬nal grades, which they obtained at the end of the year in the Massachusetts state test (MCAS). There a total number of six features in these datasets 1) DA Original Count is the number of questions that the students answered with assistance in the dynamic condition. 2) DA Original Percent Correct is the percent of questions of feature 1 that students get correct . 3) DA Scaffold Percent Correct is the percentage on tutorial help questions that students get correct. 4) DA Average Time is the average time that a student spends on a question 5) DA Average Attempt is the average number of attempts students made per question. 6) DA Average Hints is the average number of hints that students used. The task is to use these interaction features to predict the MCAS scores that students might get at the end of the school year. The static condition feature is percentage of questions answered correct in static testing. This feature is never used for making predictions for the dynamic condition. The data in the 2004-05 set (ASSISTments 2004-05) is for 628 students, while the 2005-06 data (ASSISTments 2005-06) is for 761 students. For experimentation we do a ï¬ve fold cross-validation on the dataset and report results for the base condition (P M11 ) and the various blended results which were obtained by averaging as discussed in Section 3.3. For the sake of comparison we also include results with k-means clustering too. In both cases we consider the ensembled results, with the top K predictions averaged as described in [4], [5] and also in Section 3.1. Following results in [4] and [5] we report results in terms of the mean absolute difference (MAD). Figure 3: Performance on the 2004-05 Set. Figure 4: Performance on the 2005-06 Set. Finally, for pre-processing: As mentioned in Section 2, to obtain a bipartite partitioning A must contain values that are either binary or scaled between 0 and 1. Thus, in each fold each feature column is scaled to between 0 and 1 so that An could be considered a co-occurrence matrix. This marks a slight difference from earlier papers in which the feature scaling was done so as to map all the data-points to between âˆ’1 and 1 by using the mapminmax command of MATLAB. This slight difference might result in a small variation in the results. 4.2 Experimental Results. We ï¬rst report results on the ASSISTments 2004-05 dataset. The ï¬ve fold cross-validated results using co-clustering are reported in Figure 3. The number of row clusters (k) and the number of column clusters (l) were restricted to 4 each. This resulted in 16 prediction models. The x-axis in the graph represents the ï¬rst eight prediction models on doing co-clustering, while the y-axis simply gives the mean absolute error. We observe that the accuracy of co-clustering alone is quite bad (as seen by the blue line) as compared to the baseline (P Mkl , which is basically the result for x = 1 in this graph. Note that the baseline is the dynamic condition of Feng [23]). These predictions are those given by the ï¬rst elements of the ordered set of co-cluster prediction models as deï¬ned in Section 3.3. However, averaging these prediction models successively gives better and better predictions (as can be seen by the red line). Similar results were reported in the ASSISTments 2005-06 dataset as shown in Figure 4. In this dataset the prediction models are far worse than the ensembled results as compared to the previous dataset. Again, we obtain 16 prediction models after co-clustering and successively average the ï¬rst eight (the ï¬rst with second, the ï¬rst with second and third and so on) after they have been arranged in the way suggested in Section 3.3. Again the ensembled results do much better over the baseline (we report exact ï¬gures and signiï¬cance in Tables 1 and 2). In Table 1 we compare the mean absolute errors when predictions of the ï¬rst ï¬ve prediction models are bagged. We report results when the Prediction Models are obtained both by using co-clustering and using k-means clustering on the ASSISTments 2004-05 dataset. The ï¬gures in bold indicate statistical signiï¬cance over the baseline prediction on a paired t-test. Results in Table 2 compare the predictions obtained by using co-clustering and k-means for bagging on the ASSISTments 2005-06 dataset. The results are signiï¬cantly better over the baseline and also indicate that the dynamic assessment condition returns a much better prediction of student test scores as compared to the static condition. It has already been noted that the static test condition results are signiï¬cantly worse as compared to even the baseline by [23] and [4], and thus we donâ€™t report results for the static condition. Table 1: Comparison of predictions based on k-means and Co-Clustering for the ASSISTments 2004-05 Dataset. Figures in bold indicate signiï¬cance over the baseline on paired t-test. Numbers are Mean Absolute Errors. Also note that Pred. Model 1 corresponds to the baseline. 5. DISCUSSION AND FUTURE WORK. The datasets that were used for the validation of this bagging technique, which is based on co-clustering were not very large and did not have a large number of columns. Table 2: Comparison of predictions based on k-means and Co-Clustering for the ASSISTments 2005-06 Dataset. Figures in bold indicate signiï¬cance over the baseline on paired t-test. Numbers are Mean Absolute Errors. Also note that Pred. Model 1 corresponds to the baseline. Thus, these results were initially surprising. One would imagine that in a dataset which has a small number of features, perhaps a feature selection might not be too helpful. However, our experiments show us otherwise. The results that we obtain, while modest improvements show that this technique though simple can give access to a novel source of variance in the data. It can potentially also have some nice properties in terms of returning simpler and more interpretable groups. For example, it was earlier pointed out that one row of the prediction models were actually nearly like a linear regression model in which the features are successively eliminated. At the same time it was observed that one column of the prediction models were actually just the various prediction models that we obtained on clustering alone as reported in some previous work. It would be interesting to see how the Co-Clusters (which are basically blocks in the data matrix) on a student-item dataset would pair clusters of student proï¬ciency to clusters of item performance which could be seen as a sort of a subject treatment interaction. In the literature, it has been said that the real strength of coclustering is with binary valued data, co-occurrence tables and basically in scenarios which involve collaborative ï¬ltering. Hence, datasets which are basically a student by item matrix would be an ideal candidate for trying out this technique. In the KDD Cup 2010 Toscher and Jahrer modelled student response data as a collaborative ï¬ltering task and used matrix factorization techniques for the same. Given the connections of co-clustering with matrix factorization, it is worth investigating how useful it could be in such a setting. In [3], the authors clustered students based on tutor interaction features and then trained separate Knowledge Tracing models for students based on the cluster they were in. This was done so because it was not possible to cluster the item sequences directly and an indirect approach had to be taken. This co-clustering technique seems to give an alternative by which such matrices might be clustered more readily without the need to cluster the tutor interaction features. In summary, in this paper we propose a bagging technique that uses co-clustering and demonstrate that itâ€™s performance is better than that obtained by bagging using clustering. We also suggest that it is most suitable for datasets which are like co-occurrence tables and believe that it would be a good direction for future work since such student-item datasets are usually of this form. Acknowledgements. The authors ar"

About this resource...

Visits 253

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Co-Clustering by Bipartite Spectral Graph Partitioning for Out-of-Tutor Prediction

InProceedings