formularioHidden
formularioRDF
Login

Sign up

 

Effort-based Tutoring: An Empirical Approach to Intelligent Tutoring

InProceedings

We describe pedagogical and student modeling based on past student interactions with a tutoring system. We model student effort with an integrated view of student behaviors (e.g. timing and help requests in addition to modeling success at solving problems). We argue that methods based on this integrated and empirical view of student effort at individual items accurately represent the real way that students use tutoring systems. This integrated view helps to discern factors that affect student behavior beyond cognition (e.g., help misuse due to meta-cognitive of affective flaws). We specify parameters to the pedagogical model in detail.

"1.1 Wayang Outpost: A Mathematics Tutoring System. Wayang Outpost is a software tutor that helps students learn to solve standardized- test type of questions, in particular for a math test called the Scholastic Aptitude Test, and other state-based exams taken at the end of high school in the USA. This multimedia tutoring system teaches students how to solve geometry, statistics and algebra problems of the type that commonly appear on standardized tests. To answer problems in the Wayang interface, students choose a solution from a list of multiple choice options, providing immediate feedback on students’ entries and offering hints that students can accept or reject. Students are encouraged to ask the tutor for hints that are displayed in a progression from general suggestions to bottom-out solution. In addition to this domain-based help, the tutor currently provides a variety of affective and meta-cognitive feedback, delivered by learning companions designed to act like peers who care about a student's progress and offer support and advice [1][8]. Both decisions about content sequencing and characters response are based on a model of student effort, used to assess the degree of cognitive effort a student invests to develop a problem solution, described in the next sections. 2 Modeling and Acting Upon Student Effort. We start by estimating the expected behavior that a student should have on a problem based on three indicators of effort: 1) number of attempts to solve a problem; 2) number of hints requested for a problem; 3) time required to solve a problem. These are three orthogonal axes that help understand student effort. The only pre-processing done for this data set was to use data corresponding to “valid” student users (instead of test users), and discarding outliers just for the “time” variables. Figure 1 shows examples of problem solving behavior for nearly 600 students in one problem. This one problem may seem evidently too easy at first glance, as the majority of students made zero or few incorrect attempts, saw no hints, and solved the problem in less that 5 seconds. However, this is not the case. It is common to find problem-student interaction instances where students spend little time and effort. It is also common that students under-use the help in the system. We find it essential to take into account that this is the real way that students use the tutoring system, and we need to take into account what are likely student behaviors when considering how to adjust instruction and the presentation of the material to students. Note that the distributions are not normal, but more similar to Chi-Square distributions. Figure 1. Distribution of attempts, hints and seconds in one problem. Expected and delta values. The combination of mistakes, hints and time as shown in Figure 1 will allow to estimate higher-level scenarios of mastery or disengagement, see Table 1. For each of the hundreds of problems or practice items in an intelligent tutor, we compute the median (or the sample mean after discarding the top 10 percentile, which was a good approximation in our data and much easier to compute using SQL) and standard deviation for the whole population of students. This median or mean is considered the expected value, i.e. the expected number of incorrect attempts for a problem pi (E(Ii)) where i=1…N, and N=total practice items in the tutoring system. Expected hints seen is E(Hi) and time required to solve the problem is E(Ti). We also define two delta values for each E(Ii), E(Hi) and E(Ti), a total of six delta values (see Figure 1) for each problem pi, which represent a fraction of the standard deviation, regulated by two parameters, θLOW and θHIGH in the interval [0,1]. For example, if θLOW=1/4 and θHIGH=1/2, then δIL=θLOWSD(Ii)= SD(Ii)/4 (a fourth of the standard deviation of Ii) and δIH= SD(Ii)θHIGH= SD(Ii)/2, half of the standard deviation of Ii. θLOW and θHIGH are the same for all problems in the system. These values help define what is “expected behavior” for a practice item within the tutoring system. Note that the notation for δ values has been simplified. 2.1 Pedagogical Decisions based on Student Effort. The large benefit of an effort model based on different orthogonal axes of behavior (hints, time and correctness) is that it can help researchers discern between behaviors related to student engagement (affective) and behaviors related to help misuse (meta- cognitive or affective) in addition to behaviors related to cognitive mastery. Table 1 shows the estimations of most likely scenarios made by the pedagogical model in Wayang Outpost, and the pedagogical decisions made in terms of content difficulty, plus other pedagogical moves related to affective and meta-cognitive feedback. Note that disengagement (e.g. lines 3 and 5) produces a reduction in problem difficulty, based on the assumption that if a student is not working hard enough on the current problem, they probably won’t work hard on a similar or harder problem. However, the key intervention is that Learning Companions deemphasize the importance of immediate success. Table 1. Empirical-based estimates of effort at the recently completed problem lead to adjusted problem difficulty and other affective and meta-cognitive feedback. The retrieval of an increased difficulty item is based on a function Harder(H[1..n], γ) that returns a problem of higher difficulty; H is a sorted list of n practice items the student has not yet seen, all harder in difficulty than the one the student has just worked on; H[1] is the item of lowest difficulty, and H[n] is the item of highest difficulty, and γ is a natural number greater than zero. The problem returned by Harder is specified in Eq. 1. For example, Harder with γ=3 will return the problem at the 33rd percentile of items in list H[1..m]. FORMULA_(1). Similarly, a problem of lesser difficulty is selected with function Easier(E[1..n], where E is a sorted list of problem items, all items are easier than the problem just seen by the student; E[1] is the item of lowest estimated difficulty, and E[n] is the item of highest difficulty. Eq. 2 shows Easier as a function of n and γ. Easier with γ=3 will return the item that at the 66th percentile of items in list E[1..n]. FORMULA_(2). Both Easier and Harder work upon the assumption that there are easier or harder items to choose from. The next section addresses what happens when m=0 or n=0. 2.2 Progression through Knowledge Units. In Wayang Outpost, the curriculum is organized in a linear set of topics or knowledge units (KU), which is a classification of problems in sets of items that involve similar skills (e.g. polygon perimeter measurement problems). Pedagogical decisions about content sequencing are made at two levels: within a topic and between topics, skills or knowledge units. This section addresses between topic decisions. The criteria of “chunking” problems in knowledge units is based on the idea that similar problems should be seen close to each other, to maximize the transfer of what a student has learned, as the concepts are still in working memory to be applied to the next cognitive transfer task. Cognitive effort is then reduced, and the likelihood of applying a recently learned skill to the next task is enhanced. Each knowledge unit may be defined at a variety of levels, and is composed of a variety of problems involving a set of related skills. For instance, within the “Statistics” topic, a student may be presented with problems about finding the median of a set of numbers, or deciding whether the mean or median were larger, from a picture of a stem and leaf plot. While overlap of skills exists, not all problems within a topic involve the same skills, and their difficulties may vary to a large degree. Topics are arranged according to pre-requisites (problems presented in KU2 will not include skills introduced in KU3). When a topic begins students are presented an explanation of the kinds of problems that will follow, generally introduced aloud by pedagogical animated creatures. Sometimes this involves an example problem, accompanied by a worked-out solution via multimedia features. Figure 2. Spiral curriculum in which Knowledge Units are ordered according to pre-requisites. Table 2. Conditions for topic switching in Wayang Outpost. A student progresses through these knowledge units depending on a variety of criteria, specified in Table 2 beyond cognitive mastery. For instance, condition 2.2 shows how a topic switch may be forced based on limitations of content --the system failed FKU times to find a problem of the difficulty it believes the student should get for the topic. If the pedagogical model suggests the student should increase problem difficulty, but there are no harder problems remaining, then a counter for the number of failures for the current topic is increased. Because failures < FKU an easier problem is provided instead. Another possibility is condition 2.3, where the teacher has allocated a specific amount of time for the student to study or review a certain topic. 2.3 Problem Difficulty Estimates. The pedagogical model must be able to estimate problem difficulty in order to assign problems for students in specific scenarios. We identify two faces of problem difficulty in intelligent tutors. From the perspective of a knowledge engineer, problems have objective difficulty (e.g., based on number of skills and steps involved in each problem). However, students may perceive each problem differently according to a student perceived difficulty (SPD). While objective problem difficulty should be similar to SPD, they are not necessarily the same. Proper estimation of problem difficulty is essential for this pedagogical model, and not possible to do with simple Item Response Theory because tutoring involves more dimensions (help, engagement) than testing (accuracy). We capture SPD from the three independent sources of evidence of students’ effort to solve a problem: 1) correctness in term of number of required attempts to solve a problem (random variable Ci); 2) amount of time spent in a problem (random variable Ti); 3) amount of help required or requested to solve the problem correctly (random variable Hi). We define problem difficulty di for a practice activity component i in Eq. 3, as the mean of these three factors: attempts to solve, time and help needed. FORMULA_(3). Where dci is the difficulty factor in terms of correctness, dti is the difficulty factor in terms of time, and dhi is the difficulty factor in terms of help needed. Alternatively, the three factors might be given a weight, to emphasize them differently. di, dci, dti and dhi are normalized values in the interval [0,1] and express SPD. Eq. 4, 5 and 6 show how each of the three difficulty factors are computed. FORMULA_(4). FORMULA_(5). FORMULA_(6). dci (Eq. 4) is the expected value of Ii (number of incorrect attempts while trying to solve a problem pi) across all students who have seen that problem, divided by the maximum E(Ij) registered for any problem pj in the system (N=the total number of problems or practice activities in the system). Similarly, dti (Eq. 5) is the expected value of Ti (time spent on problem pi) and is also normalized. This expected time is the mean value after removing outliers, or median. dhi (Eq. 6) is the expected value of Hi (number of attempts for problem pi) divided by the maximum E(Hj) registered. 2.4 Accuracy of Item Difficulty Estimations. We computed SPD estimates using a data set of 591 high school students who used Wayang Outpost tutoring software over past years, from 2003 until 2005. The tutors employed a variety of problem selectors during those years, with some percentage of students using a random problem selector. Validating that student perceived difficulty estimates were reasonable seemed essential. The first reason is that the difficulties play a crucial role in the adaptive behavior of the tutor, and inappropriate difficulties would make the system behave in undesired ways (e.g. providing a harder problem when the student clearly needs an easier one). The second reason is that it is just too likely that the student perceived difficulty estimates are biased, because student behavior is contingent to the problem selector in place at the moment the data on problem performance was collected. Unless the raw data comes from a random selection of problems, student behavior and thus the data collected will be biased in some direction. This will make problems look easier or harder than they truly are. We devised a variety of methods to assess the correctness of our estimation of perceived student difficulty, and implemented three of them. All of these are based on the following axiom: “Pairs of Similar Problems Should have Similar Problem Difficulty Estimates”. In other words, if two problems are very similar, the perceived differences in their difficulty should approach zero. We subsequently drew a subset of 60 mathematics problems (p1 to p60) from our tutoring system. These sixty problems are special because may be divided into 30 pairs of problems, where each pi , with i=1…30, is extremely similar to p30+i. In this domain of geometry problems, similar problems involved similar showing graphics with slightly different angles, or measurements. For example, same problems with a rotated figure (and different operands). Similar problems involve the application of the same skills the same amount of times. We call these highly similar pairs and now describe four criteria used to verify that these pairs are similar in their difficulty estimates. 2.4.1 Criteria 1: Correlations. We tested that such pairs had similar difficulty estimates with a simple Pearson correlation, which is the most familiar measure of dependence between two quantities. It is obtained by dividing the covariance of the two variables by the product of their standard deviations. A Pearson correlation determined that pairs of problems were significantly correlated (N=30, p<0.000, R=.823), thus this test is passed. 2.4.2 Criteria 2: Mean Squared Error. Another criteria used was that the difference in objective difficulty between highly similar problems should be smaller than the difference in difficulty between either of these problems and any other problem in the system that is not as similar –other problems will involve different skills, or different total amount of applications of the same skills. While it may be coincidental that a problem foreign to the pair might have a very similar difficulty to either problem in the pair, this should not be the general case. The distance between the difficulty of a problem pi and its highly similar problem pair p30+i should be smaller than the mean distance between one of the problems in the pair and the remaining problems in the set. A more common jargon when talking about differences due to error is the mean squared error. Eq. 7 rephrases the above in terms of squared differences, where N=total number of pairs=30. FORMULA_(7). If we can show that this inequality holds in general for problems, we have some evidence that our system is doing a reasonable job at estimating difficulties. We computed the 30 square differences, and their corresponding mean squared differences as specified in Eq. 7. The result was that the inequality holds for 29 of the thirty cases, which is a 97% success rate. A paired-samples t-test for the two inequality terms in Eq. 6 revealed that these two sides of Eq. 7 are significantly different t(29)=7.35, p<.000. The second test is then passed. 2.4.3 Criteria 3: Human Expertise. While pairs of highly similar problems should have similar student perceived difficulty levels, they don’t necessarily need to have exactly the same difficulty (i.e. the difference in their difficulty levels will not be exactly zero. In other words, is a small number. While it would be hard to determine the true value of epsilon for each problem pair, an expert human eye (e.g. a teacher or tutor) could probably make good predictions about whether. This kind of expert knowledge can help us establish that the latter problem should be harder for a student to solve than the former one. Other restrictions may have to do with operand size, involvement of decimals or negative numbers, or a small extra step. We managed to establish such restrictions for 21 of the 30 pairs of problems we considered, the other 9 were just too similar to each other. Such restrictions (true positives or true negatives) were correctly guessed in 14 of the 21 cases (67%), and a Chi-Square test revealed this is significantly better than chance (Pearson Chi-Square=5.25, p=.022). Thus, the third test is passed. 2.4.4 Criteria 4: Convergence. Ideally, the difference between highly similar pairs of problems would converge to as more data arrives to the logs, even if different problem selectors are in place at different moments. This test is still ongoing. 2.5 Evaluation of Effectiveness of Effort-Based Pedagogical Model. While we may be satisfied that difficulty of items are reasonably estimated, we need also to show that the adaptive mechanism underlying the pedagogical model makes a difference to student learning. A study was carried out in the 2003-2004 academic year with 60 students to evaluate the effectiveness of the adaptive sequencing of problems, compared to a random selection of problems within a topic (no learning companions or affective feedback). Both the experimental and the control conditions implemented topic switching based on one parameter only, NKU, so that the “topic switch” criterion was set to a fixed maximum number of problems per topic. This was established so that all students were exposed to the same number of problems in each topic. MKU, FKU and TKU were then ignored. The main difference between conditions was the problem selection mechanism within the topic. For the experimental condition, it adjusted problem difficulty as described in previous sections, with the following parameters: γ=2; θLOW=0; θHIGH=0; this made the changes in problem difficulty quite marked. Control condition students received random problems within each topic. Students were randomly assigned to either the Effort-based Adaptive Problem Selection condition, or the Random Problem Selection Condition. Students used the Wayang Tutoring System for 4 class periods, completing a 10-item math test before starting and a similar posttest the last day. The test consisted of items drawn from the SAT (Scholastic Aptitude Test) and released by the College Board. The two tests were counterbalanced –half of students received pretest A, and half pretest B, and the tests were reversed for students at posttest time. We measured the total number of correct items achieved in the test, and the accuracy at items (correct/test items attempted) as a measure of performance, see Table 3. We obtained full pre and posttest data for 56 students, 23 in the experimental adaptive condition, and 33 in the control condition. Table 3 shows the mean and standard deviation of pretest and posttest scores for the pretest and the posttest. Mean achievement in the posttest increased and standard deviations reduced for both groups. However, mean improvement was higher for the experimental adaptive problem selection group (Figure 3). Figure 3. Pre to Posttest Improvement with Effort-based Pedagogical Model compared to a Random Problem Selector Within the Topic. Table 3. Pretest and Posttest Scores in Math Test. This difference is significant (ANCOVA for posttest score with pretest score as a covariate, group effect F(55,1)=8.4, p=.006). The group receiving adaptive effort-based pedagogical decisions about problem difficulty improved more than did the group receiving random problem selection control condition. We conclude that adaptive problem selection is better than random. 3 Summary. This paper presented a novel approach to the development of smart learning environments, based on empirical measures of student effort at individual items. It described a pedagogical model that uses empirical estimates of problem difficulty, specifying parameters that regulate behavior within knowledge units (γ, θLOW and θHIGH) and between knowledge units (MKU, FKU , TKU, NKU). Knowledge Units may be defined at different levels of abstraction, thus addressing restrictions of content. This allows for replication in other ILEs, even in ill-defined domains or in small ILEs that are trying to encode smart decisions about practice items or activity selection. We have described criteria for evaluating that estimates of problem difficulty are not too biased to the problem selector in place at the time of data collection. Last, we have shown that this effort-based pedagogical model leads to improved learning compared to uninformed random decisions within a topic or knowledge unit."

About this resource...

Visits 130

0 comments

Do you want to comment? Sign up or Sign in