The Rise of the Super Experiment

InProceedings

John C. Stamper

Kenneth R. Koedinger

Steve Ritter

Proceedings of Educational Data Mining, 2012

2012 2012

Traditional experimental paradigms have focused on executing experiments in a lab setting and eventually moving successful findings to larger experiments in the field. However, data from field experiments can also be used to inform new lab experiments. Now, with the advent of large student populations using internet-based learning software, online experiments can serve as a third setting for experimental data collection. In this paper, we introduce the Super Experiment Framework (SEF), which describes how internet-scale experiments can inform and be informed by classroom and lab experiments. We apply the framework to a research project implementing learning games for mathematics that is collecting hundreds of thousands of data trials weekly. We show that the framework allows findings from the lab-scale, classroom-scale and internet-scale experiments to inform each other in a rapid complementary feedback loop.

"1. INTRODUCTION. Web-based software is creating an explosive growth in the use of randomized controlled experiments in education, due to the relative ease with which users can be randomly assigned to different experimental conditions. Scientists are beginning to recognize the coming data surge and developing new ways of analyzing data at ""internet scale."" The vastly increased scale of subject populations online can produce a categorically different mode of experimentation in education. For this reason, we propose a new experimental framework that takes advantage of rapid internet-scale experimentation, while retaining the control of lab-scale and classroom-scale experiments. Randomized controlled trials are regularly used to drive design decisions on the internet. In its simplest form, A/B testing is a form of experimentation where one of two advertisements are randomly delivered to each incoming site visitor. This allows advertisers to determine which advertisement results in improved outcomes (such as a greater click-through rate) [3]. Multiple tools exist to support website optimization, including the free Google Site Optimizer that supports both A/B tests and multi-variable testing. Recently, free-to-play online game companies, such as Zynga, have made use of large-scale optimization experiments with their large number of online players. By randomly assigning players to hundreds of different game design configurations, they can optimize the game design to maximize the conversion of players to paying customers [7]. 2. Internet Scale Research in Education. Internet-scale research introduces new potential methods in Educational Research. For instance, optimization experiments like Response Surface Methods, are a common applied research method for improving industrial process outcomes. These experimental designs showed early promise for improving educational outcomes [5], but because the designs would have required many hundreds of students, they were expensive and impractical. Internet-scale research can now support these optimization experiments, along with these other experimental advantages: Increased number of conditions. With tens of thousands of â€œuser-subjects,â€ internet-scale research studies present the opportunity for researchers to run dozensâ€”even hundredsâ€”of different experimental conditions simultaneously. This easily contrasts with lab or field-scale studies, where available resources and subject pools typically constrain experimental designs to fewer than 8 experimental conditions. Furthermore, with fewer conditions, experiments can be conducted within days, rather than months. Ability to measure â€œtrueâ€ task engagement. Internet-scale research is also uniquely suited for measuring task engagement. Because the researcher typically lacks control over participants (they can quit far more easily than in lab or classroom experiments), the internet is an ideal setting for investigating user motivation. If players assigned to condition A play significantly longer than players in condition B (i.e., were engaged in the task for longer), then condition A can be said to be more engaging than condition B. The ability to measure and compare engagement makes it possible to measure how different design elements and configurations affect player engagement. Increase in external validity. A third advantage of internet-scale research is the high external validityâ€”experiments are conducted with actual â€œreal-worldâ€ users. While the lack of control over subjects can result in noisy data, this noise is useful for preventing over the over-fitting of predictive models that constructed for use â€œin the wild.â€ Greater access to all users. A fourth advantage of internet-scale research is the fact that informed consent is not required if the users are anonymous. Even with educational exemptions to informed consent, parental opt-out forms can still pose a barrier to many field-based educational studies. While researchers could potentially make use of informed consent (and thus obtain non-anonymous data), anonymous data collection is likely to remain a characteristic of most large internet-scale research. Of course, the lack of information about participants is also a key drawback of internet-scale research. Broadly speaking, internet scale studies cannot collect rich information about participants. Therefore, these studies are unlikely to be suitable when research questions require demographic data, detailed pre/post tests, participant observation, talk-aloud protocols, or any kind of psychophysiological measure. Finally, the lack of participant control means that internet scale studies may not be appropriate if repeated participation over time is required. Given these drawbacks, it is clear that traditional lab based experiments and structured field trials still provide valuable data that internet scale experiments cannot. However, there is much to be gained from internet scale studies. The Super Experiment Framework (SEF) seeks to illustrate how different scales of experimentation can productively inform one another. The SEF framework, seen in Figure 1, is split into three general experimental parts that are roughly delineated by scale. Lab-Scale experiments are smaller highly controlled studies that take place in a lab or single classroom, generally not exceeding 50 participants. School-Scale experiments are formal experiments that take place in multiple classrooms or schools consisting of hundreds to thousands of participants. Internet-Scale experiments are informally delivered online to thousands to millions if participants. Figure 1. The Super Experiment Framework showing how each of the component scales informs the others. In the SEF framework, each component provides an experimental level that can be used to answer specific questions that might be difficult or impossible to answer using one of the other components. Further, the various components can be used to expand or validate findings of the other components. A feedback loop can also be used with the framework where internet scale experiments can identify areas of focus for lab scale experiments, which can then be validated in school scale experiments. An overview of each of the SEF components can be seen in Table 1. School scale and lab scale experiments typically recruit subjects and then randomly assign them to different experimental conditions as part of a single experiment. However, internet-scale research creates situations where multiple experiments are randomly drawing from the same pool of subjects. Just as a single experiment contains multiple experimental conditions, the SEF contains multiple experiments. Because the different experiments are derived from the same pool of random assignment, experimental conditions that are not part of the same experiment may still be compared to one another, if desirable. While there may be few immediate benefits of this comparison, the super experiment is a unique characteristic of internet-scale research. Therefore, the use of the term â€œsuper experimentâ€ in the super experiment framework simply refers to the broad network of information flow between different scales of experimentation, from the lab scale, to the school scale and to the internet scale. Table 1. Components of the Super Experiment Framework. 3. IMPLEMENTATION EXAMPLE. The need for the SEF framework was initiated through our work in creating online games for learning. The number of potential experiments was large and the opportunity to field the games at each of the scales identified in the SEF framework provided the need to build a feedback loop to execute many experiments at internet scale in order to narrow down the potential experiments to test at the more controlled school scale. â€œBattleship Numberlineâ€ (BSNL), an online educational game, benefits from the super experiment framework. Designed to improve number sense among elementary and middle school students, BSNL provides practice estimating numbers on a number line within four content domains: whole numbers, fractions, decimals and measurement [4]. The game narrative involves defending Numbaland Island from invading robot pirates by firing projectiles at their ships and submarines. BSNL involves two basic modes: naming numbers and placing numbers. In the naming condition, players type a number that corresponds to the location of an enemy ship that is positioned on a number line between two marked endpoints. In the placement mode, the player is given the numeric location of a hidden submarine (e.g., â€œSubmarine spotted at 1/3â€) and needs to click on the location that they believe corresponds to the number. After the player has typed a number or clicked on the number line, a projectile drops vertically from the top of the screen to the designated location on the number line. Animation and text-based feedback communicates the playerâ€™s accuracy after every round. A primary goal of our research has been to understand how different game design factors affect player learning and engagement. In order to systematically investigate these factors, we implement these design factors as flexible xml-based parameters that can be determined at the game runtime. We are then able to create online experiments that randomly assign new players to a set of different game sequences. During gameplay, BSNL generates an online data log of the task context (the above xml parameters) along with data describing the playerâ€™s performance on each opportunity. On each item, we log the playerâ€™s reaction time, their accuracy, and a binary field indicating whether the player was successful or not. Logs are then imported into the PSLC Datashop [2], which allows for the secondary analysis of player performance and learning. The hit rate measure is essential for enabling Datashop to plot learning curves of error rate over time. By labeling different items in the game with different knowledge components (e.g., reducible fractions, unit fractions, etc), we can plot learning curves for each knowledge component. Learning curves can also be described based on fluency [1], where we plot the reduction of reaction time over opportunities played. In addition to these measures of learning and performance, we investigate player engagement through two measures: the total number of items played and the total amount of time spent playing. These two metrics correspond with our construct of intrinsic motivation or player engagement. The number of potential parameter settings in BSNL makes it a great tool to answer many research questions, but at the same time the number of possible settings make it difficult to decide on what settings to in traditional lab or school settings. For this reason, it is a perfect candidate for use in the SEF. Next, we show how the results of different types of experiments at one scale inform new experiments on a different scale. Lab Scale informing School Scale. The use of a lab experiment to inform a field trial at a school is one of the most common types of experimental design. It is still an important part of the SEF. We performed a lab scale experiment, which is now being validated at the school scale. This experiment was conducted at a small Catholic liberal arts University. Although the college is co-educational, its focus is on womenâ€™s education, and 89% of the participants were women. Participants were 18 students in an eight-week first-year seminar course, which met once per week. Students chose for this seminar period to focus on mathematics games. Over 5 weeks, we administered a short (typically one minute) paper-and-pencil pretest, asked students to play a specific fluency game for approximately one-half hour and then gave a posttest which was identical in content to the pretest. In all but the first week, the pretest was preceded by a delayed post-test, which was a repeat of the posttest from the previous weekâ€™s materials. In four of the five experiments significant improvement was shown on a delayed post-test, and three of the five showed immediate results. Effect sizes were also quite large, ranging from 0.4 to 2.4, indicating that these results are not only significant but substantial. Prior to the first experiment, students were given a survey about their confidence in mathematics (containing questions like â€œI am sure that I can learn math.â€) and about text anxiety (containing questions like â€œI am so nervous during a test that I cannot remember facts that I have learnedâ€). The two scales were mixed in a 16-item form. Students were asked to rate each statement from 1 (â€œstrongly disagreeâ€) to 5 (â€œstrongly agreeâ€). Student confidence increased significantly, t(14)=-3.2, p<.01, d=0.4, but there was no change in test anxiety, t(14)=-3.1, n.s. Due to the success of this lab scale experiment, a similar school scale experiment is now being conducted in multiple college classrooms over an entire semester. Unlike the lab scale, the researchers are not present in these classrooms, but we expect to see similar results. School Scale informing Internet Scale. BSNL was designed based on an existing body of literature that investigated number line estimation in the laboratory [6]. The game was playtested with 8 elementary school students, to refine usability issues in the design. Following this, a school scale study was conducted with 119 students in grades 4-6. Students showed significant improvement in hit rate form the first to second opportunity (see Figure 2), and students demonstrated significant improvements in the estimation of fractions on a number line after 20 minutes of gameplay. Moreover, 82% of players (74% females, 92% males) reported that they wanted to play the game again [4]. The data from these classroom studies was imported into the PSLC Datashop to test various knowledge component (KC) models. We identified a KC model based on the various regions of the number line. This knowledge component model was then used to produce a Bayesian Knowledge Tracing adaptive sequencing algorithm. This algorithm was then tested online in comparison with a randomly sequenced level. Preliminary results suggest that the BKT adaptive sequence did not result in significantly greater player engagement than the random sequence. Figure 2. Illustrates the average improvement from the first opportunity to the second opportunity, by item presented. The clear patterns of difficulty are used to generate knowledge component models in Datashop. Internet Scale informing School Scale or Lab Scale. Internet-scale experiments can be useful for documenting the difficulty of different task configurations. This is useful in the field of EDM, as it allows for the generation of knowledge component models. Different tasks are said to require different knowledge components if and only if the tasks result in different performance rates or learning curves. Therefore, by assessing the difficulty of instances over a broad task design space, we can understand how the task design space maps to various KC models. For example, Rittle-Johnson, Siegler and Alibali found that tickmarks supported the estimation of decimals on a number line [6]. In order to replicate this work and extend it, we randomly assigned online players to 6 different conditions in both the decimal and whole number domain. Players either encountered tickmarks dividing the number line into tenths, fourths, thirds, halves (midpoint), or no tickmarks at all. Finally, an additional two conditions looked at the interaction of an adaptive sequencing algorithm with tickmarks at the midpoint. An overview of the experiments and conditions can be seen in Table 2. Over 80,000 internet users participated in the experiment. An experiment with this many conditions would be difficult to replicate in a lab or classroom. This broad investigation of the effects of guides enabled us to observe two unusual outcomes. First, there was an apparent interaction effect between our adaptive sequencing condition (termed â€œITSâ€) and the midpoint guides. Neither Second, the 10th guides apparently increased player engagement in the decimal condition, but decreased engagement in the whole number condition. These insights have led us to execute similar lab scale experiments to replicate and better understand these specific results. Table 2. List of experiments running concurrently with a total of 64 conditions. 4. CONCLUSIONS AND FUTURE WORK. Technology is forever changing the way we conduct experiments. The traditional paradigm is no longer the best way to do things. Data is coming in faster, larger, and more fine grained. Instead of focusing eScience efforts in just analyzing we have created a framework to exploit internet scale experiments, while still creating valid findings in real classrooms. The main contribution of this work is the development of the Super Experiment Framework which incorporates a feedback loop allowing for experiments of different scales to inform each other. This has become possible, and even necessary, with the use of the internet to collect a large amount of experimental data. Internet scale allows for optimization experiments that would be too expensive to do at field level. This is truly applied educational research that, as we have shown, provides insights that can inform more controlled lab or school scale experiments. We also explained our initial implementation of the SEF with a large project with broad scope and many interesting research questions. Traditional ""one-way street"" experiments of lab to school are slow to findings and outdated. Our work shows how utilizing all three scales of experiments leads to rapid findings that can lead to real implementable insights efficiently. Making the framework possible is the accessibility of internet scale experiments. The key barrier to internet scale educational research is attracting large numbers of users. Research projects rarely invest in high-quality software design and usability, which is usually necessary to achieve widespread adoption. However, once this quality is developed, large numbers of users can be reached through collaborations with one of many internet portals that seek to aggregate educational content (e.g., Brainpop.com). Another challenge is instrumenting software for generating data logs that measure player performance, learning and engagement. Log files should capture not only correctness information, but the amount of time that players spend on an activity, as well as the number of opportunities attempted to make these measures. A third challenge is the configuration of the software to allow for experimental designs. This involves the abstraction of design variables in the softwareâ€™s design space, such that different instances of the software can be created quickly. For instance, we use xml to define game levels at run-time. These configurations can then serve as different experimental conditions that can be randomly deployed to online users. Finally, one unusual new challenge in internet scale research is the efficiency of subject-pool utilization. While lab or school scale researchers expend significant effort to recruit a sufficient number of subjects in order to achieve statistical significance, internet scale researchers increasingly face the challenge of making use of tens of thousands of subjects in an efficient manner. Certain types of experimentation may result in inconsistent user experiences that reduce overall participation. Some challenges will be particular to individual experiments. For instance, in our online experiments we observe strong seasonal effects of weekends and school holidays, where the number of players is greatly reduced. This suggests that certain experimental comparisons should be sensitive to the time period of the study, not merely the number of subjects. Many of these challenges can be mitigated by validating the results of internet scale experiments with controlled classroom experiments. As shown in the experiment section, we are continuing to run a number of experiments of scales based on findings of different scales. This feedback loop will continue in the future as we strive to optimize the games to maximize learning. We believe this framework will rapidly lead to significant discoveries that are replicable at each of the scales. 5. ACKNOWLEDGMENTS. We would like to thank the Pittsburgh Science of Learning Center, the DataShop staff, the Next Generation Learning Challenge, Carlow University, and Pellisippi State University for supporting this research."

About this resource...

Visits 174

Save to My personal space
Send link

Categories:

Educational Data Mining (EDM)

Tags:

0 comments

Do you want to comment? Sign up or Sign in

¿Cómo puedes configurar o deshabilitar tus cookies?

The Rise of the Super Experiment

InProceedings