The monitoring and support of university freshmen is considered very important at many educational institutions. In this paper we describe the results of the educational data mining case study aimed at predicting the Electrical Engineering (EE) students drop out after the first semester of their studies or even before they enter the study program as well as identifying success-factors specific to the EE program. Our experimental results show that rather simple and intuitive classifiers (decision trees) give a useful result with accuracies between 75 and 80%. Besides, we demonstrate the usefulness of cost-sensitive learning and thorough analysis of misclassifications, and show a few ways of further prediction improvement without having to collect additional data about the students.
"1. The OneRule classifier reached the accuracy of 68% taking the VWO Science mean as a predictor. None of the other classification algorithms was able to learn a model which would outperform it (statistically) significantly. Attribute ranking (with respect to the class attribute) according to the information gain criterion showed that the VWO Science mean, VWO main and VWO Math mean were by far the best attributes in information gain (information gains 0.16, 0.13, 0.12 respectively), with the next “closest†attribute VWO Year lagging behind (0.05). Furthermore, these three attributes are highly correlated and therefore it is logical to expect it would be hard to learn a more complex and yet generalizable classifier with a relatively small dataset. Learning a classifier with feature selection also does not improve the results a lot. Learning a J48 tree using only the three mentioned attributes gives an average accuracy of 71%. Table 1. Classification accuracy on pre-university dataset. The same classification techniques were applied to the dataset with the university grades (Table 2). The OneRule algorithm results in the classifier which checks the grade for Linear Algebra (LinAlgAB), and decides positive if this grade is bigger than 5.5 (that is exactly the minimum for passing a course). Again we can see that more sophisticated classification techniques do not improve accuracy very much. However, it is worth noticing that the CART classifier is statistically significantly better than the base line with a classification accuracy that is 4.8% higher on average. Table 2. Classification accuracy on university grades dataset. The CART classifier learnt a compact tree with five leaves and uses LinAlgAB as root of the tree, and CalcA, Calc1 and Project nAttempts as further discriminators. It is worth noticing that the grades of the Networks course are not used at all, while some of its attributes have higher information gains. Correlation analysis however does show that correlation between Linear Algebra and Networks attributes is rather strong, but weak between Linear Algebra and Calculus attributes. 3.2 Classification with complete data. Classification accuracies for the dataset containing both pre-university and university related data are shown in Table 3 (column indexes correspond to those in Tables 1 and 2). Table 3. Accuracy and rates of total dataset. It can be seen that these accuracies are comparable with those achieved on the dataset with university related data only. Apparently, the pre-university data does not add much independent information that can improve classification accuracy. However, we can see that the trees learnt with J48 are now statistically significantly better than the base line model. The other tree-based classifiers also achieve reasonable accuracy, while the Bayes Net and JRip algorithms slightly fall behind. To get a better insight on the performance of classifiers, the scoring of the algorithms is shown in more detail now. A remarkable fact is that the base line model has a higher false negative rate than all other models. This is an interesting finding, because according to the student counselor it is better to give an erroneous positive advice to a student who should actually be classified as negative, than to give a erroneous negative advice to a student who should be classified as positive. Cost-sensitive learning can be used to balance classification accuracies or boost the accuracy for a particular type of prediction. 3.3 Boosting accuracy with cost-sensitive learning. In order to “advice†a classification algorithm to prefer one type of misclassification to another a cost matrix (that has a direct mapping to the confusion matrix) is commonly used as an input to a meta classifier: classified as negative classified as positive actual negative C(−,−) C(−,+) actual positive C(+, −) C(+,+) By choosing the weights C(i, j) in a certain way we can achieve a more balanced classification in case of severe class imbalances (using the diagonal entries), or a more cost-effective classification (using the off-diagonal entries). Since cost matrices are equivalent under scaling, and we only want to increase the cost of false negatives over false positives, it suffices to build a matrix with only one free coefficient and structure [[0 1] [C 0]], with C > 1. Since our experiments favored tree-based learners we used J48, J48graft and CART as base classifiers in Weka’s CostSensitiveClassifier. To prevent the tree from growing too big, we used the CfsSubsetEval feature subset selection algorithm that tries to select the most predictive attributes with low intercorrelation. The J48 and J48graft classifiers were forced to have at least 10 instances for each node in order to prevent overfitting and unnecessarily complex models. Combining these CART, J48 and J48graft with the two ways of using the cost matrix in cost-sensitive approach (data weighing and model cost), six experiments were conducted using F measure for defining the precision-recall tradeoff (we used β = 1.5). For each combination, the settings giving the highest F measure is presented in Table 4. The tree learnt with the “plain†J48 is presented in the first data column. The results indicate that it is necessary to sacrifice some of the achieved accuracy to be able to shape the misclassification. Only model 5 achieves a high accuracy and a high F measure, all other models lose in accuracy if F is increased. During the experiment, it became clear that there is not much room for enhancement: if recall increased to values higher than 85%, the overall accuracy results were unacceptable. The only exception is model 7 (notice the size of this tree being much larger comparing to other models and also seem to be too detailed to be meaningful for decision making). In some cases, small trade-offs could be made changing C. Compare for instance model 5 with model 6: a three percent point drop in accuracy gives a three percent rise in recall. The created decision trees are remarkably similar: in every tree the LinAlgAB attribute is dominant, with CalcA as first node in most of the cases. When NetwB is chosen as the first node, the recall is lower, although the difference is too small to draw decisive conclusions. Table 4. Accuracy results with cost-sensitive learning. 4 Further evaluation of the obtained results. As the final step, we examined one of the models (model 7 from Table 4) in more detail to see if we can gain better understanding of the classifier errors. The student counselor compared all the wrongly classified instances of model 7 with his own given advices to check for interesting patterns. One of the first assessed things was the question whether the learned model is incorrect or the classification criterion is chosen incorrect. To examine this, two methods were used. Firstly, the false negative and false positive sets have been checked manually by the student counselor. His conclusions were that about 25% of the false negatives should be true negatives instead. This finding might indicate a wrong classification measure. Concerning the false positive set a conclusion is less obvious: about 45% of this set was classified as positive by the student counselor as well as by the tree, but did not meet the classification criterion. A substantial subset of these students have chosen not to continue their bachelor program in Electrical Engineering although all indications for a successful continuation were present. Qualifying these students as false positive does not seem to be appropriate. So from this evaluation based on domain expertise we can conclude that some of the mistakes might be due to the classification measure, and some of them raise suspicion on behalf of the learned model. The second way to check the viability of the model is to compare the results obtained with this classifier with respect to the three class classification problem, i.e. identifying first manually the third so-called risk group and then checking whether wrongly classified students will be in the risk class (that would indicate that the learned model is actually more accurate and also that it has difficulties in predicting the students who are difficult to classify into success or failure categories per se). However, we observe that only 25% of the misclassified instances are in this category. It should be noted that this is still twice as much as the risk students ratio in the total dataset. Therefore, this also indicates that the learned model should be improved. Furthermore, 25% of the instances in the false positive class would be classified as good using the three-class classification thus indicating a real difference between two classifiers. So from this test we can also conclude that the model as well as the classification criterion should be revised. After the analysis of errors, the misclassified sets are looked up in the database to search for meaningful patterns manually. A very clear pattern popped up immediately: almost all misclassified students did not have a database entry concerning LinAlgAB (and therefore were mapped to zero). Checking out different students showed that there are many possible reasons now to have a zero value in the LinAlgAB record: a) a student might be of a cohort in which the LinAlgAB exam was in January or later; b) a student might have not shown up during the exam; and c) a student might have taken another way to get its LinAlgAB grade: in some years it was possible to bypass the regular exam by doing the subexams LinAlg1, LinAlg2, LinAlg3, LinAlg4 and LinAlg5. A student succeeding in taking this path can well be an excellent student, but gets a zero mark for the LinAlgAB attribute. Due to this effect, 216 of the 516 students do have a zero entry in their LinAlgAB record (of which 155 instances were classified as unsuccessful and 61 instances as successful). Moreover, the same effect will play a role for the other courses too. Given the dominant position of the LinAlgAB attribute in the decision trees generated in section 3.3, attempts in completing the data-set should be considered worthwhile. 5 Conclusions and Future work. Student drop out prediction is an important and challenging task. In this paper we presented a data mining case study demonstrating the effectiveness of several classification techniques and the cost-sensitive learning approach on the dataset from the Electrical Engineering department of Eindhoven University of Technology. Our experimental results show that rather simple classifiers give a useful result with accuracies between 75 and 80% that is hard to beat with other more sophisticated models. We demonstrated that cost-sensitive learning does help to bias classification errors towards preferring false positives to false negatives. Surprisingly (according to the student counselor) the strongest predictor of success is the grade for the Linear Algebra course, which has in general not been seen as the decisive course. Other strong predictors are grades for Calculus, Networks and the mean grade for VWO Science courses. The most relevant information is collected at the university itself: the pre-university data can be summarized into a few attributes. The in depth model evaluation pointed to three major improvements that can be assessed. Firstly, a key improvement in this dataset would be to find a solution for the changing course organization over the set. Aggregating the available information about student performance for a course in a way that can be used for all students in the dataset might prevent the type of misclassifications that is now strongly prevalent. A second, related improvement would be a better way to encode grades in general. Mapping all unknown or not available information to zero showed to be not effective. Specifically, Linear Algebra grades should be available. A more advanced solution dealing with missing values also can be considered in this respect. The quality of the classification criterion is the third improvement that might be considered. The simple binary classification as used in this study has some disadvantages: a negative classification can only be given after three years, and there is no guarantee that a student who does not get his propedeuse after three years will be not successful in the long run. Also, students who do not receive a propedeutical diploma, should not necessarily be “disqualifiedâ€: they may have had different motives to discontinue their studies. This touches on a more fundamental topic: it is not easy to find an objective way of classifying students. In this paper we experimented with the so-called 0/1 loss and cost-sensitive classification. AUC optimization is also one of the directions of further work. As a final remark we would like to point out that this study shows that learning a model on less rich datasets (i.e. having only pre-university and/or first-semester data) can be also useful, provided the data preparatory steps are carried out carefully."
About this resource...
Visits 135
Categories:
0 comments
Do you want to comment? Sign up or Sign in