Acquiring Item Difficulty Estimates: a Collaborative Effort of Data and Judgment

InProceedings

Proceedings of Educational Data Mining, 2011

2011 2011

The evolution from static to dynamic electronic learning environments has stimulated the research on adaptive item sequencing. A prerequisite for adaptive item sequencing, in which the difficulty of the item is constantly matched to the knowledge level of the learner is to have items with a known difficulty level. The difficulty level can be estimated by means of the item response theory (IRT), as often done prior to computerized adaptive testing. However, the requirement of this calibration method is not easily met in many practical learning situations, for instance, due to the cost of prior calibration and due to continuous generation of new learning items. The aim of this paper is to search for alternative estimation methods and to review the accuracy of these methods as compared to IRT-based calibration. Using real data, six estimation methods are compared with IRT-based calibration: proportion correct, learner feedback, expert rating, paired comparison (learner), paired comparison (expert) and the Elo rating system. Results indicate that proportion correct has the strongest relation with IRT-based difficulty estimates, followed by learner feedback, the Elo rating system, expert rating and finally paired comparison.

"1. INTRODUCTION. Most e-learning environments are static, in the sense that they provide for each learner the same information in the same structure using the same interface. One of the recent tendencies is that they become dynamic or adaptive. An adaptive learning environment creates a personalized learning opportunity by incorporating one or more adaptation techniques to meet the learnersâ€™ needs and preferences (Brusilovsky 1999). One of those adaptation techniques is adaptive curriculum/item sequencing, in which the sequencing of the learning material is adapted to learner-, item-, and/or context characteristics (Wauters, Desmet & Van den Noortgate 2010). Hence, adaptive item sequencing can be established by matching the difficulty of the item to the proficiency level of the learner. Recently, the interest in adaptive item sequencing has grown, as it is found that excessively difficult items can frustrate learners, while excessively easy items can cause learners to lack any sense of challenge (e.g. PÃ©rez-MarÃn, Alfonseca & Rodriguez 2006, Leung & Li 2007). Learners prefer learning environments where the item selection procedure is adapted to their proficiency, a feature which is already present to a certain extent in computerized adaptive tests (CATS; Wainer 2000). A prerequisite for adaptive item sequencing is to have items with a known difficulty level. Therefore, an initial development of an item bank with items of which the difficulty level is known is needed. This item bank should be large enough to include at any time an item with a difficulty level within the optimal range that has not yet been presented to the learner. In CAT, the item response theory (IRT; Van der Linden & Hambleton 1997) is often used to generate such a calibrated item bank. IRT is a psychometric approach that emphasizes the fact that the probability of a discrete outcome, such as the correctness of a response to an item, is function of qualities of the item and qualities of the person. Various IRT models exist, differing in degree of complexity, with the simplest IRT model stating that a personâ€™s response to an item depends on the personâ€™s proficiency level and the itemâ€™s difficulty level. More complex IRT models include additional parameters, such as an item discrimination parameter and a guessing parameter. Obtaining a calibrated item bank with reliable item difficulty estimates by means of IRT requires administering the items to a large sample of persons in a non-adaptive manner. The sample size recommended in the literature varies between 50 and 1000 persons (e.g. Kim 2006, Linacre 1994, Tsutakawa & Johnson 1990). Because IRT has been a prevalent CAT approach for decades, it seems logical to apply IRT for adaptive item sequencing in learning environments that consist of simple items. However, the difference in data gathering procedure of learning and testing environments has implications for IRT application in learning environments. In many learning environments, the learners are free to select the item they want to make. This combined with the possibly vast amount of items provided within the learning environment leads to the finding that many exercises are only made by few learners (Wauters et al. 2010). Even though IRT can deal with structural incomplete datasets (Eggen 1993), the structure and huge amount of missing values found in the tracking and logging data of learning environments can easily lead to non-converging estimations of the IRT model parameters. In addition to this, the maximum likelihood estimation procedure implemented in IRT has the disadvantage of being computationally demanding. Due to these impediments that go together with IRT based calibration, we are compelled to search for alternative estimation methods to estimate the difficulty level of items. Some researchers have brought up alternative estimation methods. However, the accuracy of some solutions were not compared to IRT based calibration and the various methods were not compared in a single setting. The purpose of this study is to review the accuracy of some alternative estimation methods as compared to IRT-based calibration in a single setting. 2. EXPERIMENT. 2.1 Related Work. 2.1.1 Item Response Theory. To estimate the item difficulty, the IRT model with a single item parameter proposed by Rasch (Van der Linden & Hambleton 1997) is used. The Rasch model models the probability of answering an item correctly as a logistic function of the difference between the personâ€™s proficiency level (Î¸) and the item difficulty level (Î²), called the item characteristic function: FORMULA_1. The IRT-based estimation of the difficulty level will be estimated on the basis of the learnersâ€™ data obtained in this study. In addition to that, IRT-based calibration conducted on preliminary examinee data by Selor, the selection agency of the Belgian government, serves as true difficulty parameter values. 2.1.2 Proportion Correct. A simple approach to estimate the difficulty level of items is to calculate the proportion of correct answers by dividing the number of learners who have answered the item correctly by the number of learners who have answered the item. To obtain the item difficulty parameter, the proportion correct scores has to be converted as follows: FORMULA_2. where Î²i denotes the item difficulty level of item i, ni represents the number of learners who have answered item i correctly, and Ni represents the number of learners who have answered item i. The advantage of this approach is that the item difficulty can be calculated online due to the easy formula which does not require many computational resources. Furthermore, the item difficulty can be updated after each administration. The lower the proportion of students who have answered the item correctly, the more difficult the item is. Johns, Mahadevan and Woolf (2006) have compared the item difficulty level obtained by IRT estimation with the percentage of students who have answered the item incorrectly, and found a high correlation (r=0.68). 2.1.3 Learner Feedback. Some researchers have applied learnerâ€™s feedback in order to provide adaptive sequencing of courseware in e-learning environments (e.g. Chen, Lee & Chen 2005, Chen, Liu & Chang 2006, Chen & Duh 2008). After a learner has studied a particular course material, he is asked to answer two simple questions: â€œDo you understand the content of the recommended course material?â€ and â€œHow do you think about the difficulty of the course materials?â€. After a learner has given feedback on a 5-point Likert scale, scores are aggregated with those of other learners who previously answered this question by taking the average of the scores. The new difficulty level of the course material is based on a weighted linear combination of the course difficulty as defined by course experts and the course difficulty determined from collaborative feedback of the learners. The difficulty parameters slowly approach a steady value as the number of learners increases. In this study the procedure of Chen et al. (2005) for adjusting the difficulties of the items is slightly altered. The course difficulty as defined by course experts is not taken into account. Instead, the difficulty estimates are solely based on the collaborative feedback of the learners. After an item is presented, the learner is asked a feedback question â€œHow difficult did you find the presented item?â€. The learner answers on a 5-point Likert scale (Likert, 1932), ranging from -2 (â€œvery easyâ€) over -1 (â€œeasyâ€), 0 (â€œmoderateâ€), 1 (â€œdifficultâ€) to 2 (â€œvery difficultâ€). The item difficulty based on learner feedback is then given by the arithmetic mean of the scores. 2.1.4 Paired Comparison. Another method, already used in CAT, to estimate the difficulty level of new items is paired comparison (Ozaki & Toyoda 2006, 2009). In order to prevent content leaking, experts are asked to assess the difficulty of items through one-to-one comparison or one-to-many comparison. In this method, items for which the difficulty parameter has to be estimated, are compared with multiple items, of which the item difficulty parameter is known. The underlying thought that prompts this item difficulty estimation approach is Thurstoneâ€™s paired comparison model. While Thurstone (1994) modelled the preference judgment for object i over object j, Ozaki and Toyoda (2006, 2009) modelled the difficulty judgment of item i over item j. In this study a similar procedure of the one employed by Ozaki and Toyoda (2009) is adopted to estimate the difficulty level by means of paired comparison. After an item is presented, the learner has to judge where the presented item should be located in a series of 11 items ordered by difficulty level from easy to difficult. This means that the raters have to make a one-to-many comparison with 11 items of which the item difficulty parameter is known. The probability that item i is more difficult than item 1, according to N raters is expressed as: FORMULA_3. Where Î²i is the difficulty of item i judged by the raters, b1 is the difficulty parameter of item 1 as estimated by the preliminary IRT analysis, conducted by Selor. In this study 11 items are presented simultaneously and the raters have to select one out of 12 categories: i<1, 1

About this resource...

Visits 307

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Acquiring Item Difficulty Estimates: a Collaborative Effort of Data and Judgment

InProceedings