Mining Student Behavior Patterns in Reading Comprehension Tasks

InProceedings

Proceedings of Educational Data Mining, 2012

2012 2012

Reading comprehension is critical in life-long learning as well as in the workplace. In this paper, we describe how multidimensional k-means clustering combined with Bloomâ€™s Taxonomy can be used to determine positive and negative cognitive skill sets with respect to reading comprehension tasks. This information could be used to inform environments that support students improving their meta-cognitive skills.

"1. INTRODUCTION. Anderson and Pearson [4] in their seminal work on reading comprehension describe three different cases where reading comprehension is a problem. First, a person having difficulty reading is likely to have gaps in knowledge. Prior knowledge is necessary in the determination of what he/she can currently comprehend. Second, the reader can have an incomplete understanding of the relationships that exist among facts on a certain topic. Since the current knowledge base is used to create all of the relationships on a topic, new arbitrary information can be a source of confusion, slow learning and slow processing, which leads to unsatisfactory reasoning. Third, readers are unlikely to be able to make correct inferences about the material in order to arrive at a coherent overall representation of the topic. The creation of a coherent representation for a topic requires the drawing of precise, integrated inferences. Often poor readers do not perform these tasks either routinely or spontaneously [8]. Any reading comprehension tools or models need to be able to address these problems with deep comprehension. The reading strategy instruction method is one of the most often suggested methods for enhancing reading ability [18, 20]. This particular method deals with problems on the vocabulary and sentence levels [2], and on higher level issues such as text comprehension [2,14]. Other recommended approaches include determining the main message of the content (e.g. summarization), the use of textual enhancements (e.g. illustrations, mental images), question and answer drills (e.g. self-questioning) and practicing meta-cognition (e.g. through comprehension monitoring) [9]. However, the most successful reading strategies combine methods rather than one single technique [14]. There are several barriers to the adoption of multiple strategies within the classroom setting [14]. First, there is a large amount of training that is required for the teachers to become familiar with the strategies in order to employ them within the classroom setting. Second, there is a considerable time requirement for teachers to prepare the course materials. Third, getting the students to apply the strategies in daily life can be extremely complex. Therefore, the creation of environments that help relieve the teacher of some of these complexities would be of great benefit. There are several learning environments that aid students with their reading comprehension. Some of the more prominent are Project Listen, iSTART, Point&Query, and AutoTutor. Project Listen [13] creates an environment where children and ESL (English as a second language) students can read text out loud with the aim to improve this skill. The software listens to the reader and makes suggestions on how to improve their reading skills. One of the ways that the software increases reading comprehension is by asking the students questions about the text that they just read [6]. Presumably, the increase in reading comprehension and word comprehension do not translate into helping the students enhance the deeper comprehension skills discussed by Anderson and Pearson [4] since this is not the aim of this particular software. The remaining environments, however, do take aim at creating deeper understanding within the reading comprehension field. Point&Query augments current learning environments, such as hypertext and hypermedia, by providing learner controlled question and answer sessions that expose readers to deep causal questions [10]. Both AutoTutor and iSTART make use of animated agents and natural language dialogue to scaffold inquiry strategies, metacognition, and explanation construction [10]. AutoTutor generates why, what-if, and how style questions and then enters into a dialogue with the student to expose the deeper constructs of the topic. iSTART takes a coaching approach to teach the students how to construct and improve self-explanations combined with other metacomprehension strategies. Although these systems have demonstrated student learning gains and improvement in learning strategies, more can still be done. Many of the aforementioned tools, created to aid in reading comprehension, are more closed-ended systems that require a significant amount of time and energy to develop course content [15]. These closed-ended systems often make use of help requests to aid them in determining when a student is having a reading comprehension problem [6,10,13]. However, the vast majority of environments such as WebCT/Blackboard, Moodle, etc., that are adopted by schools and post-secondary institutions are more open-ended in nature. These systems provide much more flexibility in terms of content development and improved ease in making changes to the content compared to what can be provided with closed ended systems. The problem with the open-ended systems is that they provide no real support for student learning other than providing the content for the students. However, open-ended systems do have good tracking facilities in place to capture student interaction with the system. By making use of current data mining techniques and pedagogy aimed at improving student learning, it is possible to capture studentsâ€™ cognitive behavior from these open environments. Trace methodologies, such as capturing keystroke data, events, eye tracking data, etc., have demonstrated that data generated from a studentâ€™s interaction with an environment can provide the necessary information to make cognitive and metacognitive interpretations [5]. This makes sense since how a student consumes content will have a direct effect on their comprehension of that content. If we know the following: what task the student is currently working on, the difficulty of the task, and the current behavior of the student as they work on the task, we can make cognitive interpretations [5,13,17,21]. Bloomâ€™s Taxonomy [3] of the Cognitive Domain, provides a pedagogical framework for determining how cognitively difficult a question/task is. Using this framework we can determine if the studentâ€™s current cognitive skills are appropriate for the task that they are currently working on. Bloomâ€™s Taxonomy [7] and its subsequent revision [3] are comprised of three overlapping domains: cognitive, affective and psychomotor. The affective domain is comprised of emotions, attitudes and values. The psychomotor domain is comprised of physical skill mastery, coordination, etc. The cognitive domain provides a method to classify educational objectives that relate to knowledge [21]. Within the cognitive domain are six hierarchical levels in order of increasing complexity. They are: knowledge, comprehension, application, analysis, synthesis and evaluation (as revised by Anderson et. al.[3]). The first three levels are considered to be foundational learning and are based upon the ability to know and apply factual knowledge [21]. The last three levels are considered higher level learning that is more abstract in nature [20]. Bloom had originally assumed that you could not achieve the higher levels without first mastering the lower levels of the hierarchy [7]. However, it appears that it is possible to work at the higher levels on some topics without first mastering the lower levels [3]. Wankat and Oreovicz [19] provide some examples of how to apply Bloomâ€™s taxonomy to an engineering domain. Knowledge or recall involves the descriptions, definitions, generalizations and other routine information about a topic. Comprehension involves understanding the technical representations of a topic including the translation, interpretation and extrapolation of that topic. Application involves the use of topical abstractions in explicit situations such as the use of rules, procedures and theories to perform some computation. Analysis involves breaking a problem into its principal parts in order to highlight any content hierarchy, properties. Furthermore, connections and structure found within the content are defined and clarified. Synthesis involves putting together all the constituent parts of a problem into a coherent system or solution. This can be very difficult since the process is open-ended and there may be many possible solutions to the problem. Lastly, evaluation can involve making conclusions about the value of materials used in a project or the methods used in that project. There is a need to satisfy specific criteria or use some standard of appraisal. Through the use of the different levels of Bloomâ€™s Taxonomy and questions that are appropriately couched within the framework, it is possible to help learners to overcome the various problems originally posed by Anderson and Pearson [4]. 2. METHODOLOGY. An experiment was designed to look for patterns of student behavior in a reading comprehension task. Students interacted with a learning environment designed to emulate hypermedia courses offered in post-secondary institutions where written content is presented along with questions about that content. The students could view the content and/or questions in any order or manner they chose with no constraints applied to their interaction with the system. In keeping with trace methodology approaches, all of the interactions/events with the content and questions were recorded and time-stamped. These would include events such as mouse click, mouse wheel, which item was clicked or selected and so on. To aid in determining what part of the document was currently being read, a small scrollable text box that allowed 7 lines of text to be displayed as displayed in Figure 1. The size of the text box performed a couple of tasks. First, it does not take more than one minute to read the approximately 77 words contained within the text box. Although not directly used in the analysis, this could be used to determine if the individual was distracted from the task at hand. Second, it provided a means to determine how much time and how quickly the student read over the portions of the document that contained the answers to the various questions. The questions could be selected in any order and any text the students had entered into the answer text box was saved and displayed when the corresponding question was selected. None of the participants was observed, nor reported, as having any difficulty with operating the interface. Figure 1 Screen Capture of Interface. The questions were developed using Bloomâ€™s Taxonomy Action Verbs [3,7]. Bloom and Anderson created a list of verbs that direct the way that a question should be answered. These verbs correspond to a level within Bloomâ€™s taxonomy. When you place the action verb at the beginning of the question, it frames the way that the question should be answered [7]. For example, Bloomâ€™s lowest level, knowledge, contains the action verb â€˜listâ€™. Since the task of the knowledge level is to remember previously learned information, successfully listing something that the student has previously read would demonstrate that the student has mastered that level of cognitive difficulty for that content. Questions at various Bloom levels were presented to the participants in a randomized order. All of the questions were present on the screen at all times and could be selected in any order by the student. Questions were scored in order to provide a metric for how well the students were comprehending the content. In order to deal with the subjective nature of scoring question answers, a rubric was created according to the principles laid out in [16] for each of the questions. The rubric was revised a couple of times to take into account the various types of answers that were submitted during the testing phase of the development. For the lower Bloom levels, the answers generally came from one direct location within a document and so the scoring was fairly simple. For the higher level Bloom questions, information from multiple sources was expected. It was also expected that the students would bring their own prior knowledge to bear on the answer. It was here that revisions were required as the beta testing group interpreted the questions in unforeseen ways. The experiment was broken into two components. The first component provided the students with a single document to read and questions about that content were based upon the lower levels of Bloomâ€™s Taxonomy. All of the answers to the questions could be found within the document. The document that was chosen was a fairly technical document based upon Canadian privacy law as it applies to Facebook so that the participants would not have much prior knowledge of the specific subject matter. The participants were given 30 minutes to finish reading the document and answering four questions. The second condition provided the students with two more documents in addition to the first document. The purpose of this condition was to better test the higher levels of Bloomâ€™s Taxonomy. The higher levels of Bloomâ€™s Taxonomy require synthesis and evaluation and so more information and documents were needed to allow for these requirements. The second document was instruction on how to implement advanced privacy features not commonly used within Facebook and the third document was a high level overview of the privacy settings used within Facebook. Again the answers to the high level questions could be found within the documents provided. However, in order to fully answer the higher level questions, information from more than one document was required. For this second condition, 90 minutes were allotted as the questions were more difficult and there were two new documents that needed to be read to generate complete answers. Two questions were aimed at the prior reading done in the first condition. One was a repeated question from the first condition and a second question was new but based solely on the information found in that first document. The remaining six questions were new and tested various levels of Bloomâ€™s Taxonomy. It was possible that the students could answer the questions in an increasing level of difficulty; however, they would have to purposefully select that order since the order in which they were presented was random. Since the amount of time required to participate in both conditions might be a factor in participant involvement, both conditions were designed so that they could be run separately and using different participants depending on the participantsâ€™ wishes. In the actual running of the experiment, the majority of the participants moved from the first condition right into the second condition with no delay. The participants were adult students enrolled in a grade 12 Saskatchewan Institute of Applied Science and Technology (SIAST) Adult Education English course. There were 17 participants for the first condition and 11 for the second condition with an average age of 26. 3. RESULTS. The 28 participants generated over 8500 events in total from both conditions. Events such as the mouse clicking on a specific button or object and mouse wheel scrolling were captured. Each event was time-stamped with the user-id, event-id, current question-id, current document-id, and position within the current document. This gave us what task/question the student was currently working on, which document they were working on, where in the document they were, and what event they were using. For example, if the student moved the scroll wheel of the mouse to move down in the document we could then determine from the time-stamp data and the position data, how quickly and what material they were reading. With this information we can begin to deduce student behavior as they work at completing the various questions. In order to determine how much reading the students were doing, the timestamp data was processed so that reading, scanning and scrolling navigation times could be calculated for each interaction/event. The time cutoffs used to distinguish reading from scanning from scrolling fit with other document navigation research [1]. Any time between events greater than five seconds was classified as reading. Any time greater than two seconds but less than five seconds was classified as scanning and any time less than two seconds was classified as scrolling. The reading time also encompassed time that the participant spent thinking about the answer. In the 8500 events captured across the 28 participants, only 13 events had a time greater than two minutes and only 33 events had a time greater than one minute before another event was performed. Given the time it takes to read the content in the textbox, the total time between events including the reading and thinking times, was not a large enough percentage of the data to warrant separate classification. The total amount of time that a student spent in the experiment was calculated and used to create a ratio of time spent by the student reading, scanning and scrolling. This ratio was then broken down into the reading, scanning and scrolling ratios for each individual question. When combined with the level of difficulty for each question, as determined by Bloomâ€™s taxonomy, it was possible to tie student reading behavior to the difficulty of the task. In order to see if there were students who behaved similarly for different levels of difficulty, we implemented the Forgy method for k-means clustering for d=3 dimensions and k=4 [11]. Hammerly et al. [11] demonstrated that the Forgy method for initialization was the preferred method for initializing the standard k-means, also known as Llyodâ€™s, algorithm [12]. The dimensions that we chose were the reading, scanning, and scrolling axes. K = 4 was chosen since our sample size was small. More than 4 clusters produced some clusters where there were too few to be statistically analyzed. Since the algorithm randomly chooses its centroid points, there is no researcher bias entering into the initial sets of clusters that were created. In order to find as many interesting clusters as we could, the Forgy k-means algorithm was iterated multiple times. We defined interesting clusters as those clusters that elicited either positive or negative reading, scanning or scrolling behaviors. A positive behavior is defined as a behavior that results in a good grade. A negative behavior is defined as a behavior that results in a poor grade. Those clusters that presented with both positive and negative behaviors were deemed less interesting. Each time an interesting cluster was found, the centroid was recorded. Once multiple interesting centroids were found, the most interesting centroid found was hard coded as a starting centroid. The hard coding of the algorithm removes one of the random initializations from the Forgy initialization and inserts the most interesting centroid in its stead. For example, the experiment used k = 4 random clusters in the initialization. With the hard coded cluster added, k = 3 random and k = 1 hard coded are what the algorithm would initialize with. The algorithm was run again with one hard coded centroid and three randomly chosen centroids to see how the other random clusters interacted since how the cluster is initialized is known to have an effect on how the other clusters form [11]. If a new interesting cluster was discovered that was more predictive of studentsâ€™ behavior than a previous closely related centroid, the old centroid was removed in favor of the new centroid. If no more centroids were discovered that were more interesting than the hard coded centroid, then the second most interesting centroid was hard coded and the remaining two centroids were left random and the above process was duplicated with two hard coded centroids. A third hard coded cluster was added in accordance with the above procedure and the process was performed again until all four of the initializations centroids were hard coded. Over multiple iterations six interesting clusters were discovered with two of these clusters containing too few data points to be included in any statistical analysis that was performed. The following clusters proved to be statistically interesting with respect to the Bloom level: - Light Reading Cluster: 50% reading: 30% scanning: 20% scrolling (50:30:20) - Light Medium Reading Cluster: (60:30:10) - Heavy Medium Reading Cluster: (70:20:10) - Heavy Reading Cluster: (80:10:10) The two clusters, Medium Scrolling (10:10:60) and Medium Scanning (10:60:10), were clusters that we expect to play a more important role in future experiments. However, due to our sample size, they could not be used in our statistical analysis. An ANOVA was performed on each of the clusters to see if a statistically significant relationship could be found between the different reading behaviors as clustered by k-means and the Bloom levels of the questions within the experiment. The tests were performed at the Î± = 0.05 level. Questions at Bloom levels 1,2,3,5, and 6 were provided in this experiment. There were no Bloom level 4 questions, to give learners time to answer more questions at level 5 and 6 within the overall time constraints. Table 1 shows that, with the exception of level 5, all of the Bloom levels were statistically significant. The null hypothesis used for these tests are that the means for each of the clusters does not vary according to the Bloom level that is being tested. In other words, the reading, scanning and scrolling means should be the same for all of the clusters found by k-means. Table 1 shows that the differences found between the clusters for each of the Bloom levels were not due to random chance. The p-values indicate that, in all but two cases, there is a really small chance of getting these results if no real difference between the groups exists. This indicates that the studentsâ€™ reading, scanning and scrolling behaviors captured by the system and then clustered are significantly different from one another as it relates to the level of Bloomâ€™s taxonomy. For example, those students who were classified as Light Readers based on the reading, scanning and scrolling ratios for Bloom level 1 were significantly different from those who were classified as Light Medium Readers for the same Bloom level. However, the ANOVA itself cannot make this exact determination of which cluster is significantly different from another cluster; it can only tell us that there is a significant difference between some of the groups in the analysis. Further analysis, discussed later on, is needed in order to see which of the clusters are significantly different from each other. Although inclusion in a cluster does not completely predict scores, it is indicative of overall performance. For example, take question 2 in the first condition (low level Bloom with a single document) that was designed to force the students to scan through the document as they needed to count the number of instances that a certain event, such as a successful appeal on a complaint about Facebook to the Canadian Privacy Commission, occurred in the document. This type of problem is often present in many forms in academia and the work place where it is necessary to arrive at a solution within the time constraints. 100% of the students in the Light Reading (50:30:20) cluster, which was higher in scanning and scrolling times, achieved full marks or close to full marks. Correspondingly, those students in the Heavy Reading (80:10:10) cluster scored no better than 50% with over Â½ of the students in the cluster scoring 0%. Since the source materials were present for the duration of the experiment and there were time constraints, the Heavy Reading strategy is not the best strategy to be used in this situation. This result is somewhat surprising since it is generally accepted that Heavy Reading is considered a good cognitive strategy in a reading comprehension task. In this case, the cognitive load required to be able to answer this type of question, the time limitations of the experiment and the fact that the source materials were available, make the adoption of the Light Reading strategy a better choice. The reduction in the cognitive load by choosing to perform more scanning and scrolling through the document rather than committing the information to memory when performing Heavy Reading allows the participants to perform better on this type of task. It should be noted that for other tasks, a Heavy Reading strategy is the best choice. Furthermore, in situations where the source materials are not available during the task, the Heavy Reading strategy is most likely the best choice regardless of the task given. The Heavy Reading strategy proved to be the most successful strategy as the level of difficulty for the questions increased as measured by Bloomâ€™s Taxonomy. The participants were able to achieve better marks compared to those that chose a Light Medium Reading strategy. For example, question six of the second condition required the participants to put together various thoughts and ideas about FaceBook privacy policy from multiple documents into a complete whole thought that did not exist in any of documents (Bloom level 6). For this problem the students fell into multiple clusters. Each document had its own set of events that tied the reading, scanning and scrolling ratios to that document. This provides a mapping of how each student used each document to answer a particular question. In order for the students to get a good grade they needed to fall into the Heavy Reading category on all the documents that were required to fully answer the question. Those students that performed Heavy Reading on all the necessary documents scored well. The students that performed Heavy Reading on only one of documents they were required to read did not score above 30%. Those students that performed Heavy Reading on two of the required documents scored no higher than 83% and those that performed Heavy Reading on all of the documents scored no lower than 83% and up to 100%. Those students that used the Light Medium Reading strategy scored 0%. There was one student who scored 30% that used the Light Medium Reading strategy but their answer contained no content from any of the documents, rather they used extraneous information from their previous experience. The use of the Light Reading strategy did not appear above Bloom level three and the Light Medium Reading strategy appeared throughout the Bloom levels. At Bloom levelâ€™s five and six, those participants that chose to use the Light Medium Reading strategy did not receive good grades. The availability of the source documents to the participants did not aid them in answering more challenging questions. The participants needed to be able to recall information from a variety of sources in order to be able to fully answer the questions. Instead of using source material, probably because they could not recall where it was or if it was present, they used incorrect information from some other source outside of the experiment. It should be noted that they did not access supplementary material from either books or the Internet during this experiment. These aforementioned patterns of behavioral clustering being predictive of marks do not hold in all cases. For example, the Light Reading (50:30:20) cluster for question 3 in the second condition (higher level Bloom with multiple documents) had 50% of the students achieve 100% while the other 50% received 0%. Since we captured the current reading position within the document with each event, we can determine the amount of time spent reading, scanning and scrolling over the portions of the document that contain the answer. When analyzed, this information is able to fully account for the differences in scores found within the cluster of the above example. For example, those students who received 0% spent the majority of their time scrolling and scanning compared to those who received full scores, who spent much more time reading over the portion of the document that contained the answer. The Low Level and High Level analysis from Table 1 shows that when Bloom is broken down into two categories, low level (levels 1, 2, and 3) and high level (levels 5 and 6) there are significant differences for both the high and low levels between the clusters. When we perform the Tukey-Kramer test later on (Table 5), it will show that all of the clusters are significantly different from one another as well. Interestingly, when we combine all the levels together to see if the clusters by themselves are statistically different, we get no significant results. In other words, higher level and lower level meta-cognitive reading strategies seem to elicit different behavior on the part of learners. In order to find out which clusters were significant from each other, a Tukey-Kramer analysis is required. A Tukey-Kramer analysis allows a pairwise comparison of each of the clusters and allows a comparison of groups that do not have the same number of students. The minimum significant difference value was used to calculate if the pairwise comparison was significant and correct for the multiple comparisons. The numbers in the top right hand portion of the Tables 2 through 5 show the Tukey-Kramer minimum significant differences (MSD). The numbers in the lower left corner of Tables 2 through 5 show the observed absolute value of the difference in means between each pair of groups. Those numbers in the lower left of the tables marked with an asterisk are deemed significant if they are larger than their corresponding MSD located in the top right of the table. Table 2 shows that all of the clusters were significantly different from each other. This was found for all of the other Bloom levels except for Bloom level 2 and 5. Table 3 shows that there are significant differences between most of the groups except for the Medium Heavy Reading cluster and the Heavy Reading cluster for Bloom level 2. Although the k-means algorithm clustered these reading, scrolling and scanning ratios into two different clusters, the actual differences between the ratios was close. So the grades tended to be higher in the Medium Heavy Reading and at the same time lower in the Heavy Reading cluster. It was situations like this one where the ratios were close together that made us wonder if a breakdown of individual Bloom levels was the best predictor or if the levels should be more coarse-grained and moved into a high level Bloom category and a low level Bloom category rather than individual Bloom levels. One of the major problems with this experiment was that we did not have a large enough sample size for the higher levels of Bloom as tested in the second condition. Table 4 shows that there were no significant differences found between any of the clusters at Bloom level 5. A more in-depth analysis showed that most of the students chose a similar strategy to answer those questions and therefore no significant differences were found between the clusters. Table 1 One way ANOVA for Bloom Level. Table 2 Tukey-Kramer Analysis Bloom Level 1 (* denotes significant differences). Table 3 Tukey-Kramer Analysis Bloom Level 2. Table 4 Tukey-Kramer Analysis Bloom Level 5. Table 5 Tukey-Kramer Analysis Low Level Bloom. With a larger sample size we believe that even this would become statistically significant. The fact that a significant difference was found in Bloom level 6 may just be due to an artifact in the data; however, the significant differences found in the lower Bloom levels 1 through 3, given the slightly larger N, seems to imply that with a larger N we will see those same differences in the higher Bloom levels. For the low level and high level Bloom groupings, significant differences were found between all of the clusters. Table 5 shows the significant differences found for the low level Bloom grouping. Figure 2 Graph of Reading Ratio vs Bloom Level. Next we analyzed how the clusters were related to the Bloom levels. Figure 2 shows that the Light Reading behavior was not found in any questions above Bloom level 3. This seems to indicate that Light Reading behavior is not conducive to the more cognitively difficult tasks. The Heavy Medium Reading cluster had only 2 instances in questions above Bloom level 3. The decreasing use of Heavy Medium Reading as the Bloom level of difficulty increases shows that some of the studentsâ€™ adapted a heavier reading behavior compared to the Light Reading behavior at the lower Bloom levels. They gave up the Heavy Medium Reading strategy for the Heavy Reading strategy used more in the higher Bloom levels. The Heavy Reading cluster was found at each of the Bloom levels. As the Bloom levels incre"

About this resource...

Visits 150

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

Mining Student Behavior Patterns in Reading Comprehension Tasks

InProceedings