Argument Graph Classification via Genetic Programming and C4.5

InProceedings

Collin Lynch

Kevin D. Ashley

Niels Pinkwart

Vincent Aleven

Proceedings of Educational Data Mining, 2008

2008 2008

In well-defined domains there exist well-accepted criteria for detecting good and bad student solutions. Many ITS implement these criteria characterize solutions and to give immediate feedback. While this has been shown to promote learning, it is not always possi- ble in ill-defined domains that typically lack well-accepted criteria. In this paper we report on the induction of classification rules for student solutions in an ill-defined domain. 1 We compare the viability of classifications using statistical measures with classification trees induced via C4.5 and Genetic Programming.

"1. LARGO analyzes the student diagrams for structural, contextual and content-related â€œcharac- teristicsâ€ which we use as the basic features of our current analysis. Each characteristic is defined by a particular graphical pattern that, if it matches some portion of a studentâ€™s diagram, identifies a possible structural weakness or opportunity for reflection. Figure 1. Sample LARGO Graph. The characteristics were developed with the help of an experienced legal instructor. For example UNLINKED_TEST, a context char- acteristic is active when the student has formed a test box in the graph but has not linked that box to the transcript. Such linking is a necessary part of good note-taking as it enables the students to reconnect their diagrams to the relevant parts of the arguments. The structural characteristic FACTS_ISOLATED_FROM_HYPOS is active when the student has produced a fact node but not linked it to the relevant hypothetical nodes. These diagram characteristics are associated with phases of graph production (1=orientation, 2=transcript markup, 3=diagram creation, 4=analysis, and 5=reflection). Characteristics of phases 1-3 can be thought of as indicating basic â€œweaknessesâ€ of a diagram (e.g. UNLINKED_TEST), while characteristics of phases 4 and 5 stand for opportunities for reflection contained in a diagram. The system provides feedback on diagrams in the form of self-explanation prompts triggered by the characteristics. In the earlier phases these prompts inform the student about how to fix up the diagrams. In the later phases, the prompts encourage reflection on the diagram and argument representation. These hints are provided upon request. LARGO also contains a facility for collaborative feedback. For each case in the system we have identified two target test statements in the transcript. These are test statements that our do- main expert considered to be particularly crucial for the analysis process. Students who link a test node to one of these statements are given the opportunity to rate other studentsâ€™ statements of the same test and to reconsider their own. Students who go through the process and whose tests are rated poorly by their peers are given the opportunity to change their own test in response. This characteristic is TEST_REVISION_SUGGESTED. It is active for students whose test has been rated poorly but have not changed the test statement. See [14] for a more detailed analysis of the help system and an argument example. We have completed three studies of the LARGO system. In the Fall of 2006 we conducted a study with paid volunteers taken from the first year class at the University of Pittsburghâ€™s School of Law (Novice-2006). Students were randomly assigned to analyze a pair of cases using LARGO or a text-based notepad tool with no feedback. We compared test scores between the groups and analyzed student interactions with the system. We found no overriding difference between the conditions, and close examination of the questions showed that some were too easy causing a ceiling effect. However on other question types lower aptitude students, as measured by their Law School Admission Test (LSAT) score (a frequently used predictor for success at law schools) in the LARGO condition, showed higher learning gains on some question types than their low-LSAT text peers. Also, the use of the help features was strongly correlated with learning [15]. In the Fall of 2007 we performed a follow-up study as part of the first year legal process course (Novice-2007). The study was mandatory for all 85 class members. As before students were assigned randomly to text or graph conditions. However the study included one additional case and students answered case-specific essay questions after each session. We also replaced some questions from the pre- and post-tests that had produced a ceiling effect with more challenging alternatives. We again found no significant differences between conditions. A post-hoc analysis revealed that the students in the first study made far more use of the advice functions than the students in the second study, which may explain the difference between the study outcomes. We are presently conducting a follow-up study with LARGO among third-year law students (Expert-2008). All participants in this study used LARGO and performed the same set of tasks as those in Expert-2007. The purpose of this study is to examine novice-expert differences in the context of LARGO. At the time of this paper a total of 17 third-year students have completed the study and their data, along with data from the Novice-2007 study, are employed below. C4.5 is a decision tree induction algorithm [16]. When presented with data it induces an n-ary decision tree that acts as a functional classifier. Each interior node of the tree represents a logical test that branches to one child or another based upon the outcome of the test. Leaf nodes represent predictions or decisions made. Decision trees are traversed from root to leaf. Each decision path p from the root to a leaf node defines a class of cases based upon the relevant features and a classification tag to be assigned to those cases. Each tree therefore represents a hypothesis or test for carving up the space of diagrams according to the factors involved in each. One such decision tree, and its representation in pseudocode are shown in Figure 2. As we shall discuss below this tree predicts student scores as at or below the mean (0) or above the mean (1) where the inner nodes represent tests for the presence or absence of graph characteristics and the leaves, classifications. Each unique path from root to leaf in a decision tree defines a distinct class of objects. For our purposes these denote unique classes of student solutions. Intriguingly TEST_REVISION_SUGGESTED is taken as a sign of high performance. As we noted above, this is a later phase characteristic and will not be active unless the students have successfully marked up a target region of the diagram with a test node, and then made use of the collaborative feedback with their test summary having been given a poor rating by their peers. Figure 2. Sample Decision Tree with Pseudocode. Genetic Programming (GP) is a type of Evolutionary Computation (EC) [12, 2]. In EC algorithms a population of candidate problem solutions is evolved over time via selection mutation, multiple-parent crossover and other operations inspired by biological reproduction. The field of EC arose initially out of work in artificial life [13] and has since been applied in a number of domains including design and decision making. EC is a stepwise algorithm that starts with a population of randomly generated or externally defined candidate solutions. The fitness of each individual is assessed either by comparison to a gold standard or a competitive â€œtournamentâ€ selection. Based upon this fitness individuals are then permitted to pass their genetic code to the next by means of cloning, combination of genetic material with other fit members, or random mutation. The algorithm as a whole continues until an absolute fitness threshold is reached or a maximum number of generations has passed. In genetic programming the individual members of the population are interpreted as function code with their performance compared against a target function or task. For the purposes of our experiments the target function was the mapping Ï† : f â†’ bm of graph features f to the mean- score bin bm we discussed above. In this case the raw fitness is defined by the ratio of correct classifications to total classifications in the set. One disadvantage of GP is the tendency of systems using unconstrained representations (like ours) to select for code that is not just successful but genetically robust. Such code is characterized by introns or redundant code elements that protect the core function from destructive crossover and slow the discovery of new solutions. This necessitates the use of parsimony pressure to control code growth. In this project we applied a scaled penalty based upon size. Higher performing trees were assessed a larger penalty than lower performers. In this experiment we made use of two reproduction operators: mutation and crossover. Under mutation an individual is copied directly into the next generation with a sub-tree being replaced by a new, randomly generated sub-tree. Under crossover, two parents exchange randomly selected sub-trees with the children being passed to the next generation. Members of the population at time t are selected for reproduction based upon their fitness. Some forms of GP select individuals proportionally according to their absolute fitness. However this often results in extreme genetic drift toward initially fit individuals and reduces the selection pressure as the Ïƒ f ,t goes down. We therefore employed sigma scaling to assign each individual a reproductive fitness value of: FORMULA_1. We then select individuals using Stochastic Universal Sampling [12] which ensures that each individual reproduces at least âŒŠExp(i, t)âŒ‹ but no more than âŒˆExp(i, t)âŒ‰ times. Taken together these measures prevent genetic drift by ensuring that selection pressure is still high even as the absolute fitness increases. As a machine learning algorithm, GP has a number of advantages. It is well suited to the evolution of arbitrary structures ranging from neural networks to object-oriented programs. This makes it attractive for our present purposes. However, GP also has a number of disadvantages. As a non-deterministic algorithm it makes fewer guarantees about its performance than a more bounded, biased and specialized algorithm such as C4.5. And, while the statistical behavior of the system and use of proper tuning work to prevent random drift, it cannot be completely eliminated. It is also computationally costly; each of our runs required 12 hours of operation on a modern PC. 3 Results and Analysis. As stated in the introduction, our goal in this study is to examine the prospects for automatically inducing higher-order pedagogical knowledge from subject graphs. By analyzing subject graphs using machine learning methods we seek to identify potential target rules for the classification of successful and unsuccessful learners and to explore the interaction of the graph characteristics. For purposes of this analysis we made use of the final graphs and post-test scores taken from the Novice-2007 and Expert-2008 studies. In both studies the subjects followed the same procedure and took the same tests. The graphs we analyzed were produced for two competing arguments in the Burnham v. Superior Court (495 U.S. 604) case, each of which was represented as a single unified set of graph characteristics as interpreted by LARGO. The post-test score was a single value representing their overall score in the absolute range. We elected to use the final graph students generated in the course of the study as it was created as the culmination of the studentsâ€™ training and thus, was most likely to be correlated with their final performance. We were forced to remove some of the Novice-2007 subjects from our analysis as they took too little time on the post-test or too little time to read the cases (both indicating a lack of serious effort) or, in the case of four, because they candidly informed us that they were not trying to answer the questions. This left us with 34 students from the Novice-2007 study and 17 from the Expert-2008 study giving us a total of 51 graph/test pairs. We binned the graph/test pairs according to their post-test score. We then binned the subjects by mean score (0.63) into two groups, those above the mean, and those at or below it. Of the 51 students 22 were below the mean while 29 were above it. Two of the Expert-2008 subjects fell below the mean score. Our Ï‡2, C4.5 and GP analyses below are based upon this grouping. Statistical Comparisons: As we report in [10], simple statistical analyses of the graph fea- tures such as the number of nodes or relations do not correlate highly with the studentsâ€™ learning outcomes. This was true both for the full set of 51 subjects as well as the study subgroups. While some of the measures do correlate with group membership (i.e., expert students produce more in- terconnected graphs than non-experts) they do not correlate with studentsâ€™ ultimate performance. Our analysis showed no overall correlation between the phase groups and student performance. Again while there was some difference between the study groups those differences were not signif- icant. However, once we binned the full set of graph results by mean score a distinction emerged. In particular two of the characteristics were significantly correlated with bin membership. UN- LINKED_HYPO was significantly correlated with having a less than average score (c2(16.16,N = 51) = 1.00, p < 0.001) as was UNLINKED_TEST (c2(18.27,N = 51) = 1.00, p < 0.001). This highlights the importance of studentsâ€™ linking of tests and hypos in their diagrams to the argument transcript. In addition, TEST_REVISION_SUGGESTED was marginally significantly correlated with high performance (c2(4.07,N = 51) = 1.00, p < 0.05). As the reader will recall this is a â€™late phaseâ€™ characteristic and requires some successful problem solving steps to occur before it is active. This strong correlation of the linking features with student performance fits our domain model. The connection of diagram elements to the argument transcript enables students to retain the context for each note and helps them to develop a â€œrespect for the textâ€ that is a goal of legal instruction. The significance of TEST_REVISION_SUGGESTED prompted us to examine the student help behavior more closely. As you will recall this characteristic is activated when a student has marked up the target region, completed collaborative feedback, but not changed his test. From that we determined that very few of the students modified their test statements in response to this char- acteristic. Thus students who reached this point are not sufficiently differentiated. This led us to conclude that additional effort must be made to motivate studentâ€™s help usage, particularly in the later â€œreflectiveâ€ phases of work. C4.5: We split the individual datapoints evenly into a 90/10 Test-train split with 45 Training cases (19 at or below the mean and 26 above) and 6 test cases (3 at or below and 3 above). We did not perform an iterative cross-validation as our goal was to induce information from known algorithms, not to validate the algorithms on existing data. C4.5 produces the pruned tree that is shown in Figure 2. This tree successfully classifies 82.2% of the training cases and 100% of the test cases. Interestingly the only graph features employed within it are UNLINKED_TEST and TEST_REVISION_SUGGESTED. In many respects the tree supports the Ï‡2 analysis in highlighting the importance of the tran- script linking, particularly for test nodes. Students who link their nodes to the transcripts do well while those who do not are split between students who receive a TEST_REVISION_SUGGESTED and are above the mean and those who are not. Thus students who perform well in other respects by highlighting the key transcript region, summarizing it, and partially completing collaborative filtering have been able to avoid linking all of their test nodes. GP: For the Genetic Programming experiments we employed the same test/train split over all as with C4.5. On both the Mean classification task the evolutionary algorithm showed early successes. As of generation 93 the system produced the Mean classification tree shown in Figure 3. This tree correctly classified 87% of training cases successfully and 100% of test cases. Subsequent generations showed some improvements with the system achieving 89% correct classification of training instances as of generation 659. However the resulting trees were quite large suggesting a problem of overfitting the data. Introns were already present at generation 93 as shown by the useless appearances of NO_ELEMENTS and ISOLATED_HYPO_DISCUSS and the frequency of such code only increased as the process went on. Note also that both the UNLINKED_TEST and TEST_REVISION_SUGGESTED rules are present in this tree but not UNLINKED_HYPO. Figure 3. GP Mean Decision Tree. In analyzing the tree shown in Figure 3 we will focus on the three classes of poor perform- ing students that it defines. The root node of the tree is DISTINGUISH_WITHOUT_TEXT. This characteristic is active when a student has noted that one node is distinguished from another (irre- spective of node type) without giving a justification. As we noted above distinctions, according to our model, are always motivated by some principled or factual justification with which the student should annotate the arc. As with the C4.5 tree students who exhibit this characteristic alone are rated low unless they also exhibit TEST_REVISION_SUGGESTED. The significance of this characteristics prompted us to examine the student help behavior more closely. Recall that this characteristic is activated when a student has marked up the target region, completed collaborative feedback, but not changed his test. From that we determined that very few of the students modified their test statements in response to this characteristic. Students who exhibit both DISTINGUISH_WITHOUT_TEXT and the FACTS_RELATIONS characteristic are classified as below the mean. This latter characteristic indicates that the stu- dents added arcs relating fact nodes to one another. Again this is a violation of our model. Taken together these characteristics indicate a misunderstanding of the role of facts in the domain model both in terms of how distinctions are drawn using facts and how nodes may be interrelated. The third class is the set of students who do not exhibit DISTINGUISH_WITHOUT_TEXT but do exhibit UNLINKED_TEST and TEST_FACTS_RELATION_SPECIFIC. This latter characteris- tic is active when the student has constructed a specific relation (e.g., â€œModified Toâ€) to the facts of the case. This is an example of the systemâ€™s providing more novel pedagogical information. While our domain model endorses the use of general relationships between the test and fact nodes, it is clear that some good students, who leave tests unlinked, also choose to use specific relations for test and fact nodes. This may signal a valid alternative to our model that bears further exploration. In this paper we assessed the potential of inspectable machine learning methods to induce useful domain information from student work. Our goal was to demonstrate the potential of these meth- ods to yield useful insights into the quality of student solutions, the tutoring systemâ€™s behavior, and the domain itself. The results we describe above have led us to conclude that these methods do hold potential for domain exploration but not without some measure of guidance. Our statistical analysis highlighted three salient graph characteristics one of which demon- strated how the use of the system by well-performing students was at odds with our desires. How- ever apart from that it validated our domain model. The use of C4.5 further confirmed the above results and helped to validate some of our domain assumptions but otherwise did not yield much in the way of new information. GP by contrast yielded a set of classification trees one of which we presented here. Examination of this tree yielded useful information both about student miscon- ceptions and student divergence from our domain model. This information has led us to consider alterations to the advice system and a reconsideration of some aspects of our domain model. These results lead us to conclude that the use of GP to induce â€œinspectableâ€ classifiers is a fruitful method of data extraction both for behavioral and pedagogical information. We believe that this process is especially useful in ill-defined domains where the relationship among individually detectable solution characteristics is not clear and the means for assessing them, open for debate. In cases such as these the use of inspectable post-hoc classification has been shown to reveal useful insights. We plan to expand upon this work by moving from the present high-level graph classification into the induction of both lower level graph characteristics and student classifiers that track per- formance over the course of the study, again with the goal of identifying useful pedagogical and performance information. At the same time we plan to combine these automatic insights with ex- pert human grading. This summer we will engage the services of law school instructors to grade the student graphs. We will then use this data to compare our assessment of good and poor stu- dents with theirs and make use of their data to train further classifiers. This will enable us to check the value of our present grading mechanism and to provide expert-level analysis to augment our existing classifications."

About this resource...

Visits 124

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Argument Graph Classification via Genetic Programming and C4.5

InProceedings