Analyzing paths in a student database
"1. INTRODUCTION. In the recent literature, several data mining models have been proposed to understand and improve the educational performance and assessment of a student learning process. For example, in [1] the authors illustrate a classiï¬cation model to investigate the proï¬le of students which most likely leave university without ending their study; in [2], the author uses association rule mining for assessing student results. Data mining techniques have been also applied in computer-based educational systems (see, e.g., [3]). In our research we present a data mining approach to analyze a database containing information about students, in particular, their personal but anonymous data and their exams. We consider the path of a student, that is, the way the student implement her or his exams over the degree-learning time: a student can take an exam immediately after a course, the ideal choice, or later.1 The aim of this work is to understand how this order affects the performance of the students in terms of graduation time and ï¬nal grade. See [4] for a recent research in the ï¬eld of choices in learning. We consider the ideal path, i.e., the path of a student which has taken each examination just after the end of the corresponding course, without delay. Therefore, the ideal path matches the curriculum settled by the academic degree. The path of a generic student is then compared with the ideal path by using two different approaches: the ï¬rst one uses the Bubblesort distance while the second is based on the computation of the area between the two paths. Students who have taken the exams in the same order, that is, students with the same path, can have different ï¬nal grade and graduation time. The idea is to understand if there exists a relation between these distances and the success of students. If the students having small distances achieve good performance, then we may conclude that the academic degree is well structured but if there exist many good students with large distances, then can mean that the organization should be modiï¬ed. Once the distances have been computed, they can be inserted in the database, as new attributes of each student, and a clustering analysis can be performed, for example by using the K-means implementation of WEKA (see, e.g., [5]). {1} We refer to an organization which allows students to take an exam in different sessions after the end of the course, as in Italy. A drawback is that students end up graduating with a signiï¬cant delay. Some constrains between exams can be ï¬xed in order to force students to take exams in a speciï¬c order, however, usually students have many degrees of freedom. By using this methodology on our database, with both approaches and K = 2, we obtained two clusters characterized by small and large distances. The ï¬rst one corresponds to the group of students who graduated relatively quickly and with high grades; the second cluster corresponds to students who obtained worst results. Our analysis shows that the more students follow the order given by the ideal path the more they get good performance in terms of graduation time and ï¬nal grade. In conclusion, no student with a large distance achieves good results; we can conclude that the academic degree under consideration was well scheduled. 2. THE METHODOLOGY. We consider a database containing the data of N students, each student characterized by a sequence of n exams identiï¬ers. We consider a particular path I = (e1 , e2 , · · · , en ), the ideal path, corresponding to a student which has taken every examination just after the end of the corresponding course, without delay. Without loss of generality, we can assume that ei = i, i = 1, · · · , n, that is, I = (1, 2, · · · , n). The path of a generic student J can be seen as a sequence J = (eJ1 , eJ2 , · · · , eJn ) of n exams, where eJi , i = 1, · · · , n, is the identiï¬er of the exam taken by the student at time i. Therefore, J can be seen as a permutation of the integers 1 through n. In order to understand how the order of the exams affects the ï¬nal result of students, we compare a path J with I by using two different approaches. The ï¬rst approach uses the Bubblesort distance, which is deï¬ned as the number of exchanges performed by the Bubble sort algorithm to sort an array containing the numbers from 1 to n. The number of exchanges, bounded above by n(n − 1)/2, can be computed easily since it is exactly the number of inversions in the permutation. Our second approach concerns the graphical representation of the paths; we represent them in the integer lattice, the x-axis denotes the number of exam and the y-axis the exam identiï¬er, according to the order of the ideal student. The ideal path is deï¬ned by the sequence of points Ï„I = ((0, e0 ), (1, e1 ), (2, e2 ), · · · , (n, en ), (n+1, en+1 )), where e0 = 0 denotes the starting point of the path and en+1 = n + 1 denotes the ï¬nal examination taken last by all students. Therefore, Ï„I can be represented as the bisecting line of the ï¬rst quadrant. The path of a generic student J, is then represented by a broken line corresponding to the sequence of points tJ = ((0, eJ0 ), (1, eJ1 ), (2, eJ2 ), · · · , (n, eJn ), (n + 1, eJn+1 )). By convention, we have eJ0 = 0 and eJn+1 = n+1 for every student J, that is, the resulting trajectory begins at (0, 0) and ï¬nishes in the point (n + 1, n + 1) . We then compare a path J with I by computing the area AJ ,I between Ï„I and tJ . In Figure 1 we illustrate the path (2, 1, 3, 5, 4) and its distances from the ideal path (1, 2, 3, 4, 5). Figure 1: The path (2, 1, 3, 5, 4) : σ(J ) = 2 and AJ ,I = 3. 2.1 Cluster analysis. The database we analyze contains data of students in Computer Science at the University of Florence beginning their studies during the years 2001-2004 and graduated up to now. The academic degree in Computer Science at the University of Florence is structured in three years (Laurea triennale) and students can choose among different curricula. Every year is organized in two semesters; there are several courses in each semester and at the end of a semester students can take their examinations. Exams can be taken in different sessions during the year and students can try to pass their exams in any of these sessions, after the end of the course. In the years under consideration, no constrains between exams were ï¬xed, so students could take their exams almost in any order. For each student, the database contains the identiï¬er of the student, the grade obtained at the high school level, the year of enrollment at the university, the date and the mark of ï¬nal examination and the sequence of exams; each exam is described by an identiï¬er, a date and a grade. We considered students belonging to two slightly different curricula: databases and distributed systems. In particular, we analyzed the paths of N = 100 students characterized by a sequence of n = 25 exams. For each curriculum we computed the ideal path through an important pre-processing phase, which allowed us to identify the semester in which courses were originally hold by the teacher. In fact, the original database did not contain the information about the semester, which is fundamental for our purposes. In the ideal path, courses relative to the same semester were sorted by taking into account the preference of students. Therefore, we obtained two different ideal paths of length 25. Then, for each student J of both curricula we computed the distances σ(J ) and AJ ,I from the ideal path and inserted these values into two ï¬elds Bubblesort and Area of the database. We tried to sort our data according to both ï¬elds and we found some pairs of paths having values of Bubblesort and Area in reverse order. Therefore, these two distances are not completely equivalent; however, this difference seems not to be important for the clustering analysis, as we will see later. To understand how the order of the exams affects the path of the students, we have performed several tests by using the K-means implementation of WEKA. We ï¬rst analyzed the paths of students separately for the two curricula. In particular, in both cases, we obtained signiï¬cant result with K = 2 and by selecting as clustering attributes the graduation time, Time, expressed in days, and the ï¬nal grade, Grade, an integer between 66 and 110. In fact, with these parameters we can see that students are well divided into two groups: the group of students who graduated relatively quickly and with high grades and the group of students who obtained worst results, respectively. Luckily, we observed that students in the ï¬rst group are characterized by small values of Bubblesort and Area while students in the second group have larger values. Our analysis shows also that the path of a student seems not to be affected by the results achieved at the high school level. We performed similar tests by adding the distance values as attributes of clustering, and we obtained two more distinct clusters, which divide students more precisely in terms of Time, Grade and Bubblesort (or Area) distance. We ï¬nally analyzed together the students belonging to the two curricula obtaining 2 clusters with the following characteristics. This result conï¬rms that, regardless of the curriculum, the more students follow the order given by the ideal path, the more they obtain good performance in terms of graduation time and ï¬nal grade. We point out that we obtained similar results by using either distance. In conclusion, the methodology proposed in this research can be used to valuate the organization of an academic degree in terms of the scheduling of the courses. ACKNOWLEDGEMENTS. The authors would like to thank Dino Pedreschi for interesting discussions about the model presented in this paper."
Acerca de este recurso...
Visitas 166
Categorías:
0 comentarios
¿Quieres comentar? Regístrate o inicia sesión