Data Mining Algorithms to Classify Students

C. Romero

Cesar Hervas

Pedro G. Espejo

S. Ventura

Proceedings of Educational Data Mining, 2008

2008 2008

In this paper we compare different data mining methods and techniques for classifying students based on their Moodle usage data and the final marks obtained in their respective courses. We have developed a specific mining tool for making the configuration and execution of data mining techniques easier for instructors. We have used real data from seven Moodle courses with Cordoba University students. We have also applied discretization and rebalance preprocessing techniques on the original numerical data in order to verify if better classifier models are obtained. Finally, we claim that a classifier model appropriate for educational use has to be both accurate and comprehensible for instructors in order to be of use for decision making.

"1. Moodle Data Mining Tool executing C4.5 algorithm. In order to use it, first of all the instructors have to create training and test data files starting from the Moodle database. They can select one or several courses and one Moodle table (mdl_log, mdl_chat, mdl_forum, mdl_quiz, etc.) or create a summary table (see Table 1). Then, data files will be automatically preprocessed and created. Next, they only have to select one of the available mining algorithms and the location of the output directory. For example, in Figure 1, we show the execution of the C4.5 algorithm over a summary file and the decision tree obtained. We can see that the results files (.tra and .test files with partial results and .txt file with the obtained model) appear in a new window (see Figure 1 down in the right hand corner). Finally, instructors can use this model for decision making concerning the suitability of the Moodle activities in each specific course and also to classify new students depending on the course usage data. 4 Experimental Results. We have carried out some experiments in order to evaluate the performance and usefulness of different classification algorithms for predicting studentsâ€™ final marks based on information in the studentsâ€™ usage data in an e-learning system. Our objective is to classify students with equal final marks into different groups depending on the activities carried out in a web-based course. We have chosen the data of 438 Cordoba University students in 7 Moodle courses (security and hygienee in the work, projects, engineering firm, programming for enginnering, computer science basis, applied computer science, and scientific programming). Moodle (http://moodle.org) is one of the most frequently used free Learning Content Management Systems (LCMS). Moodle keeps detailed logs of all activities that students perform in a data base. Information is available about the use of Moodle activities and resources (assignments, forums and quizzes). We have preprocessed the data in order to transform them into a suitable format to be used by our Moodle data mining tool. First, we have created a new summary table (see Table 1) which integrates the most important information for our objective (Moodle activities and the final marks obtained in the course). Using our Moodle mining tool a particular teacher could select these or other attributes for different courses during the data preprocessing phase. The Table 1 summarises row by row all the activities done by each student in the course (input variables) and the final mark obtained in this course (class). Table 1. Attributes used by each student. Secondly, we have discretized all the numerical values of the summary table into a new summarization table. Discretization divides the numerical data into categorical classes that are easier for the teacher to understand. It consists of transforming continuous attributes into discrete attributes that can be treated as categorical attributes. Discretization is also a requirement for some algorithms. We have applied the manual method (in which you have to specify the cut-off points) to the mark attribute. We have used four intervals and labels (FAIL: if value is <5; PASS: if value is >=5 and <7; GOOD: if value is >=7 and <9; and EXCELLENT: if value is >=9). In addition, we have applied the equal-width method [13] to all the other attributes with three intervals and labels (LOW, MEDIUM and HIGH). Then, we have exported both versions of the summary table (with numerical and categorical values) to text files with KEEL format [1]. Next, we have made partitions of whole files (numerical and categorical files) into pairs of training and test files. Each algorithm is evaluated using stratified 10-fold cross- validation. The dataset is randomly divided into 10 disjointed subsets of equal size in a stratified way (maintaining the original class distribution). In each repetition, one of the 10 subsets is used as the test set and the other 9 subsets are combined to form the training set. In this work we also take into consideration the problem of learning from imbalanced data. We say data is imbalanced when some classes differ significantly from others with respect to the number of instances available. The problem with imbalanced data arises because learning algorithms tend to overlook less frequent classes (minority classes), paying attention just to the most frequent ones (majority classes). As a result, the classifier obtained will not be able to correctly classify data instances corresponding to poorly represented classes. Our data presents a clear imbalance since its distribution is: EXCELLENT 3.89%, GOOD 14.15%, PASS 22.15%, FAIL 59.81%. One of the most frequent methods used to learn from imbalanced data consists of resampling the data, either by over-sampling the minority classes or under-sampling the majority ones, until every class is equally represented [3]. When we deal with balanced data, the quality of the induced classifier is usually measured in terms of classification accuracy, defined as the fraction of correctly classified examples. But accuracy is known to be unsuitable to measure classification performance with imbalanced data. An evaluation measure well suited to imbalanced data is the geometric mean of accuracies per class (g-mean), defined as FORMULA_1. where n is the number of classes, hitsi is the number of instances of class i correctly classified and instancesi is the number of instances of class i. In our work, we have used random over-sampling, a technique consisting of copying randomly chosen instances of minority classes in the dataset until all classes have the same number of instances, and we use the geometric mean to measure the quality of the induced classifiers. Finally, we have used three sets of 10-fold data files: the original numerical data, the categorical data and the numerical rebalanced data. We have carried out one execution with all the determinist algorithms and 5 executions with the nondeterministic algorithms. In Table 2 we show the global percentage of the accuracy rate and geometric means (the averages of 5 executions for nondeterministic algorithms). We have used the same default parameters for algorithms of the same type (For example, 1000 iterations in evolutionary algorithms and 4 labels in fuzzy algorithms). We have used these 25 specific classification algorithms due to they are implemented in Keel software, but there are some other classification techniquess such as bayesina networks, logistic regression, etc. The global percentage of those correctly classified (global PCC) shows the accuracy of the classifiers (see Table 2). More than half of the algorithms obtain their highest values using original numerical data, and the other algorithms obtain them using the categorical data. This can be due to the nature and implementation of each algorithm which might be more appropriate for using numerical or categorical data. As we have seen above, it is easier to obtain a high accuracy rate when data are imbalanced, but when all the classes have the same number of instances it becomes more difficult to achieve a good classification rate. The best algorithms (with more than 65% global PCC) with original data (numerical) are CART, GAP, GGP and NNEP. The best algorithms (with over 65% global PCC) using categorical data are the two decision tree algorithms: CART and C4.5. The best algorithms (with over 60% global PCC) with balanced data are Corcoran, XCS, AprioriC and MaxLogicBoost. It is also important to note that no algorithm exceeds 70% global percentage of correctly classified results. One possible reason for this is due to the fact that we have used incomplete data, that is, we have used the data of all the students examined although some students who did not do all the course activities did do the final exam. In particular, about 30% of our students have not used the forum or have not done some quizzes. But we have not eliminated these students from the dataset because it shows a real problem about the studentsâ€™ usage level of e-learning systems. So, we have used all the data although we know that this fact can affect the accuracy of the classification algorithms. Table 2. Classification results (Global percentage of correctly classified / Geometric Mean). The geometric mean tells us about the effect of rebalancing on the performance of the classifiers obtained, since the geometric mean offers us a better view of the classification performance in each of the classes. We can see in Table 2 that the behavior depends to a great extent on the learning algorithm used. There are some algorithms which are not affected by rebalancing (Kernel, KNN, AprioriC, Corcoran, AdaBoost and LogitBoost): the two decision tree methods (CART and C4.5) give worse results with rebalanced data (C4.5) but most of the algorithms (all the rest, 17 out of 25) obtain better results with the rebalanced data. Thus we can see that the rebalancing of the data is generally beneficial for most of the algorithms. We can also see that many algorithms obtain a value of 0 in the geometric mean. This is because some algorithms do not classify any of the students correctly into a specific group. It is interesting to see that it only happens to the group of EXCELLENT students (EXCELLENT students are incorrectly classified as GOOD and PASS students). But in education this is not very dramatic after all since the most important thing is to be able to distinguish perfectly between FAIL students and PASS students (PASS, GOOD and EXCELENT). On the other hand, in our educational problem it is also very important for the classification model obtained to be user friendly, so that teachers can make decisions about some students and the on-line course to improve the studentsâ€™ learning. In general, models obtained using categorical data are more comprehensible than when using numerical data because categorical values are easier for a teacher to interpret than precise magnitudes and ranges. Nonetheless, some models are more interpretable than others: - Decision trees are considered easily understood models because a reasoning process can be given for each conclusion. However, if the tree obtained is very large (a lot of nodes and leaves) then they are less comprehensible. A decision tree can be directly transformed into a set of IF-THEN rules that are one of the most popular forms of knowledge representation, due to their simplicity and comprehensibility. So, C4.5 and CART algorithms are simple for instructors to understand and interpret. - Rule induction algorithms are normally also considered to produce comprehensible models because they discover a set of IF-THEN classification rules that are a high- level knowledge representation and can be used directly for decision making. And some algorithms such as GGP have a higher expressive power allowing the user to determine the specific format of the rules (number of conditions, operators, etc.). - Fuzzy rule algorithms obtain IF-THEN rules that use linguistic terms that make them more comprehensible/interpretable by humans. So, this type of rules is very intuitive and easily understood by problem-domain experts like teachers. - Statistical methods and neural networks are deemed to be less suitable for data mining purposes. This rejection is due to the lack of comprehensibility. Knowledge models obtained under these paradigms are usually considered to be black-box mechanisms, able to attain very good accuracy rates but very difficult for people to understand. However, some of the algorithms of this type obtain models people can understand easily. For example, ADLinear, PolQuadraticLMS, Kernel and NNEP algorithms obtain functions that express the possible strong interactions among the variables. Finally, in our educational problem the final objective of using a classification model is to show the instructor interesting information about student classification (prediction of marks) depending on the usage of Moodle courses. Then, the instructor could use this discovered knowledge for decision making and for classifying new students. Some of the rules discovered show that the number of quizzes passed in Moodle was the main determiner of the final marks, but there are some others that could help the teacher to decide whether to promote the use of some activities to obtain higher marks, or on the contrary, to decide to eliminate some activities because they are related to low marks. It could be also possible for the teacher to detect new students with learning problems in time (students classified as FAIL). The teacher could use the classification model in order to classify new students and detect in time if they will have learning problems (students classified as FAIL) or not (students classified as GOOD or EXCELLENT). 5 Conclusions. In this paper we have compared the performance and usefulness of different data mining techniques for classifying students using a Moodle mining tool. We have shown that some algorithms improve their classification performance when we apply such preprocessing tasks as discretization and rebalancing data, but others do not. We have also indicated that a good classifier model has to be both accurate and comprehensible for instructors. In future experiments, we want to measure the compressibility of each classification model and use data with more information about the students (i.e. profile and curriculum) and of higher quality (complete data about students that have done all the course activities). In this way we could measure how the quantity and quality of the data can affect the performance of the algorithms. Finally, we want also test the use of the tool by teachers in real pedagogical situations in order to prove on its acceptability. Acknowledgments. The authors gratefully acknowledge the financial subsidy provided by the Spanish Department of Research under TIN2005-08386-C05-02 projects. FEDER also provided additional funding."

¿Cómo puedes configurar o deshabilitar tus cookies?

Data Mining Algorithms to Classify Students

InProceedings