We consider the problem of predictive and causal modeling of data collected by courseware in online education settings, focusing on graphical causal models as a formalism for such modeling. We review results from a prior study, present a new pilot study, and suggest that novel methods of constructing variables for analysis may improve our ability to infer predictors and causes of learning outcomes in online education. Finally, several general problems for causal discovery from such data are surveyed along with potential solutions.
"1 Introduction. Scientists and engineers at the Apollo Group are developing an Individualized Learning Platform (ILP) for online education, the broad overview of which is illustrated by Fig. 1 [1]. The ILP is being constructed using insight from domain experts in cognitive and learning sciences while deploying a data-driven Intelligence Engine that takes input data or “signals†from the ILP and provides appropriate guidance to administrators, faculty, and students to better customize and individualize the online learning experience. A wide variety of information is provided to and recorded by the ILP, including information about learner and faculty context, aspects of curriculum, and so on. Coupling insight from educational theory with the Intelligence Engine will allow the Apollo Group to enhance learner satisfaction while improving learning outcomes and supporting other institutional goals. Ann Brown’s influential work [2] calls for a designbased empiricism, and the reader will ï¬nd Fig. 1 only slightly adapted from a diagram in that work. The ILP is designed around several core principles, one of which is that guidance be evidence-based [1]. This paper focuses on a candidate methodology to achieve this core objective, focusing on the discovery of causes of positive learning outcomes in the online education environment. The learning management system of the ILP will track student progress and activity in online courses. A central challenge is to determine the predictors, and especially causes, of student learning outcomes given these records of their activities and interactions. Fig. 1. Illustration of Apollo Group’s Individualized Learning Platform, reproduced from [1]. Predictive models are useful for purposes such as identifying students likely to withdraw from courses or otherwise have negative learning outcomes; such models rely on discovering “symptoms†in student behaviors and activity to predict likely outcomes. If we can identify, from passive observation, students “at risk†for negative learning outcomes, instructors can “flag†students to target existing resources toward them to rectify problems a student may be having. However, “symptoms†and predictors of learning outcomes need not identify causes of learning outcomes. When we acquire causal knowledge, we acquire the ability to predict values of variables post-intervention. Traditional statistical methods focus on predictive tasks, allowing us to forecast or classify from observed attributes of a unit (e.g., student) but not to reliably do so after manipulation of the environment (e.g., online courseware, methods of instruction, etc.). If we are able to identify causes, we can better design interventions to drive learning gain and other positive outcomes for students, knowing that post-intervention these changes will drive better learning outcomes. Further, such knowledge can lead to the development and engineering of better online learning platforms and environments. However, there are several hurdles to overcome to achieve such insight. We focus on the complexity of data collected in the online environment and begin to address the dearth of literature focusing on transforming log-style, transactional data collected in online courseware for use with causal discovery procedures (1). In the next section we sketch an extant framework for causal discovery from non-experimental datasets. In Section 3, we survey a past application of this framework in the online education domain. We outline the multi-faceted complexity of data collected in online education environments in Section 4. In Section 5, we describe a pilot study of data from an online graduate economics course and suggest in Section 6, based on the results of the pilot study, that we need to construct new measures of student behavior from underlying “raw†variables. In Section 7 we outline three remaining general problems for causal discovery in the online education domain and provide concluding remarks in Section 8. 2 Causal Discovery from Observational Data. Data collected in log ï¬les and databases that underlie online courseware are non-experimental, historical data. As a result, we are rarely in the position to learn causal relations in the paradigmatic manner of the sciences: namely, from randomized experimental data. Despite natural differences in courses from offering to offering and from instructor to instructor, we as investigators cannot reach into the past to intervene and experiment with courseware or other aspects of the online education experience. Over the past twenty years, there has been a large research program by philosophers, statisticians, and computer scientists to develop many different methods for the discovery of causal relations from observational datasets. Much of this research has focused on causal interpretations of probabilistic graphical models, speciï¬cally directed acyclic2 graphs (DAGs) with associated probability distributions, called Bayesian networks ([7], [8]). Within this formalism, variables are represented by nodes in a graph with edges between nodes representing direct causal relationships between variables. Consider the graph of Fig. 2, modeling qualitative, hypothetical causal relationships among attributes of students in an imaginary online course. Several attributes might be particularly salient for non-traditional students to whom online education and degree programs are appealing. We model relationships between hypothetical measures of employment, size of family (familySize), time obligations not related to a student’s education (obligations), time spent studying, motivation, length of messages in an online discussion forum (messageLength), academic ability, and ï¬nal exam scores in a course (ï¬nal ). In our hypothetical model, student ï¬nal exam performance has two direct causes: student ability and studying. Further, the model represents relationships among the determinants of a student’s studying behavior. (1) A noteworthy exception for educational data is ([3]), though their analysis is not directed speciï¬cally at causal discovery. (2) Feedback cycles (in time) can be modeled within the Bayes nets framework by, for example, deploying variables indexed by time. Some literature ([4], [5], [6]) focuses on the discovery of cyclic graphical models, though work on this topic is underdeveloped compared to the Bayes nets formalism. Fig. 2. Graphical representation of hypothetical causal relationships for students in an online course If two crucial (usually reasonable) assumptions hold-the Causal Markov Axiom and the Causal Faithfulness Condition [7]3 - then the causal structure encoded in the Bayesian network graph implies a set of probabilistic (conditional) independence relations between the variables. The Causal Markov Axiom asserts4 that, assuming there are no unmeasured common causes of variables we consider, a variable is probabilistically independent of its non-descendents (noneffects) conditional on its direct causes. The assumption of Faithfulness asserts that all probabilistic independencies that are actually observed occur only because of an absence of a direct causal relation. That is, conditional independence between variables does not occur by accident (via canceling out settings of parameters, for example). (3) There is substantial philosophical literature about the Causal Markov Axiom and the Causal Faithfulness Condition (e.g., [9], [10], [11], [12], [13], [14], [15]). I pass over this controversy as these assumptions are standard in the causal learning framework deployed here. (4) Assuming it is possible to represent the underlying causal structure as a directed acyclic graph Fig. 3. Hypothetical illustration of causal relations that could lead to a faithfulness violation To illustrate how a violation of this assumption might occur, consider a slight modiï¬cation to hypothetical causal relations among three variables from Fig. 2 in Fig. 3, in which the association represented by each arrow is positive or negative. Suppose we posit that increased family size has a negative impact on employment; the student is likely to work less as the size of his or her family increases but that employment and family size both contribute to increased noneducational time obligations. The negative effect of familySize on employment combined with the positive effect of employment on obligations, given appropriate (perhaps unlikely) parameter values (representing strength of causal relations), may exactly “cancel out†the positive effect of familySize on obligations. Such “canceling out†parameter values could lead us to believe that familySize and obligations are independent, despite the fact that there is a direct causal relation between the two. This hypothetical judgment of independence despite a direct causal relation is a violation of faithfulness. While a causal Bayesian network implies conditional independencies, this “graphs → independencies†mapping is many-one: for any particular graph G, there will typically be other graphs, though not all graphs, that predict the same (conditional) independencies, and so are observationally indistinguishable from G. We are all familiar with the old maxim that “correlation does not imply causation.†For example, if verbosity in online message forums (messageLength) and studying are correlated, this can be explained by messageLength → studying, studying → messageLength, or studying and messageLength sharing a common cause (as they do in Fig. 2, motivation), or a combination of these explanations. Multiple graphs can imply the same observed correlations and/or independencies. We can use observational data to learn a set of (observationally indistinguishable) causal structures: namely, exactly the set of possibilities that could have produced the observed pattern of independencies in the data. Causal Bayesian network structure learning algorithms, e.g., the PC algorithm [7] and GES ([16], [17]), under suitable assumptions, will identify (the set of observationally equivalent graphs containing) the correct DAG in the large sample limit. Two rough intuitions illustrate basic principles of search for graphical causal models from conditional independence relations. The ï¬rst concerns “screening off†relations whereby, to consider a simple three variable example, two variables, say messageLength and studying, are correlated but become independent when we condition upon a third variable, motivation; this conditional independence claim tells us that messageLength and studying are not directly causally related. Assuming there are no unmeasured common causes of messageLength, studying, and motivation, this conditional independence claim is explained by three possible causal structures: messageLength → motivation → studying; studying → motivation → messageLength; or messageLength ↠motivation → studying. If we lack background knowledge, these three graphs are indistinguishable from observational data. However, if we assume student motivation to be inherent or at least temporally prior to their enrollment in a program and behavior in a course, then we can infer that motivation is a common cause of messageLength and studying. The second intuition has us consider two independent variables that share a common effect. Suppose that a student’s level of motivation and non-educational, time-consuming obligations are independent. We expect that each of these student attributes share a common effect in the amount of time a student devotes to study. Unconditionally the instructor cannot infer anything about a student’s motivation level from the knowledge that a student has many time-consuming obligations outside of the course in which they are enrolled; the instructor, however, can make inferences about a student’s motivation when the student (honestly) reports to the instructor how much they study. If we know that a student is studying a lot while juggling many obligations, we infer something about the student’s motivation level, namely that is high. We can similarly infer from a student’s report that they are highly motivated and yet are not studying as much as they would like that they are likely dealing with many outside obligations. In both cases, when we condition on a common effect two otherwise independent variables now provide information about each other. In the graph of Fig. 2, this is represented as what is called a “collider,†where arrows from motivation and obligations meet at studying (motivation → studying ↠obligations). Assuming (however unlikely) that we omit no common causes, there are no other graphical structures that explain this constraint on conditional independencies (or dependencies). That is, we can orient edges into a “collider†when such circumstances arise as we search over conditional independence relations in a larger dataset. Since we also assume that graphs are acyclic, having oriented colliders, we can often orient edges in such ways that avoid creating “colliders†where they were not discovered via tests for conditional independence. Thus, we can often orient many edges to make causal inferences from observational data alone. A constraint-based algorithm such as PC simply systematizes search over (sets of) variables in a data set to determine the conditional independence relations that hold among the variables and produces the set of graphical structures that imply those relations. 3 A Simple Causal Model of Outcomes in an Online Learning Environment The work below is certainly not the ï¬rst to deploy methods for causal discovery in the online education domain. Scheines et al. ([18]), for example, focused on a set of variables relevant to students in an online causal and statistical reasoning course, including measures of: – student background knowledge (pre: a measure of pretest abilities derived from GRE items) – behavior in online courseware (print: a measure of the extent to which students print out online course material, and volqs: a measure of the number of interactive, voluntary understanding questions attempted within the courseware), and – learning outcomes (quiz : an average of quiz scores over several course modules, and ï¬nal : ï¬nal exam grade). They then used several causal Bayes net learning algorithms to develop a path analytic model (Fig. 4) for these variables. They found interesting links between the background variable as well as their behavioral variables and learning outcomes. Fig. 4. Linear path analytic model of student behavior and outcomes in an online causal and statistical reasoning course [18]; marginally signiï¬cant edges are dashed Examining this model, student ï¬nal exam performance is well-predicted by the extent to which students “check their understanding,†and printing out reading material is negatively associated with these self-checks. Controlling for other possible mediators they still ï¬nd a negative effect (though it is only marginally signiï¬cant) of greater printing behavior on ï¬nal exam score. They cautiously suggest an interpretation of this effect as due to differing study habits. Students who print out material are less likely to engage the voluntary, interactive questions during studying while students who did not print out material may be more likely to do so. Student behavior with respect to printing course material may also be indicative of other study habits, though those habits were unmeasured in this analysis. Thus, we ï¬nd a fruitful deployment of causal discovery methods to ï¬nd behavioral and background attributes of students in the online environment that are predictive of (and more cautiously, causally related to) learning outcomes. 4. The Complexity of Collected Data and Variable Construction. Having briefly explored basic principles of causal discovery, we consider a second, more distinctive challenge to discovering causal models that arises from the complexity of the data collected from online courseware systems. Most online courseware collects an enormous amount of data about a multitude of different phenomena, which leads to very high-dimensional datasets. Variables collected fall into three rough categories: 1. purely observed variables with (relatively) straightforward interpretations or meanings 2. measured indicators of underlying “latent†phenomena, and 3. “raw†variables that require some form of construction to be interpretable or meaningful. The ï¬rst two categories are well-treated in the literature on causal discovery as well as multivariate analysis in general. Further, latent variable modeling is an active area of research in the social science methodology (e.g., psychometrics) community. We also briefly discuss procedures for the discovery of latent variable models later in this work. There is, however, little literature dealing with the third category of variables in a principled way with respect to causal discovery. Natural and social scientists construct variables frequently, but the approach taken is usually either based on signiï¬cant, richly detailed background theory or ad-hoc guesswork. Consider, for example, weather forecasting, in which prognostications are made frequently in terms of “high†pressure and “low†pressure weather systems. While these systems cover large geographic regions, their features-strength, size, speed, etc.-are constructed from a multitude of directly observed barometric pressure readings spread out over large geographic regions. Meteorologists’ high and low pressure systems are instances of constructed variables, while individual, localized barometric pressure readings are the raw variables from which such constructions arise. In general, a host of situations call for principled, data-driven methods for constructing variables from underlying “raw†data. In many cases, we measure many variables but it is not clear just what the causal variables of interest should be. While latent variable modeling is one way to potentially reduce the dimensionality of data, in some situations it is more appropriate to seek dimensionality reduction methods whereby we construct new measured variables as deterministic functions of “raw†measured variables. This difference between latent variable modeling (i.e., group 2 above) and the construction of variables via deterministic functions of “raw†variables is partly illustrated in Fig. 5. The larger rectangle of Fig. 5 illustrates a discovery and estimation problem within the framework of latent variable models. When deploying a latent variable model, the modeler must ï¬rst decide (or discover) the appropriate causal structure relating latent variables to their manifest effects and then estimate the parameters (“factor loadingsâ€) quantifying the nature of these causal relationships. Conditional upon the latent variable, each of its noisy, measured (or manifest) indicators (X1, X2, and X3 ) is independent of the other measures. Whether this condition is appropriately tested or assumed, this assumption is usually called that of “local independence.†Fig. 5. Illustrative example of the difference between latent variable modeling and deterministic variable construction. The larger rectangle envelops a discovery and estimation problem, the smaller rectangle a construction problem. Contrast this estimation problem with the heuristic illustration of a variable construction problem in the smaller rectangle of Fig. 5. Here, we call X1, X2, and X3 our “raw†variables and deterministically construct a new variable called scale from these raw variables. Since we are using a latent variable model to motivate the illustration, there are no direct connections between X1, X2, and X3, but this need not be the case in general. Whether or not the “raw†variables are unconditionally independent, they remain or become dependent when conditioning upon the deterministically constructed new variable. Consider the situation in which we construct scale as the sum of only X1 and X2, and assume X1 and X2 to be unconditionally independent. Given information that scale takes on the value 10 (conditioning on scale) and that X2 takes on the value 7, then we know the value of X1 to be 3. Thus, conditioning on scale, X2 provides us information about X1, so the two components of scale are conditionally dependent. In situations in which latent variable models are deployed, scales like this are often constructed as well, and this is just one special case of the general problem at hand. Of course in general, not just any constructions will do. The problem that we face is to reduce a set of “raw†variables {R1 , ..., Rn } to some smaller set of variables {C1 , ..., Ck } via deterministic functions {f1 , ..., fk } of the “raw†variables in order to achieve some objective or goal. In the online education domain, we focus on predicting and identifying causes of learning outcomes as assessed by exam scores, course grades, or perhaps even another constructed variable5 incorporating several aspects of learning outcomes. This search problem is clearly intractable; the search space must be constrained by some combination of background knowledge and a guiding/focal objective function. Background knowledge may be rather general. For example, we might know that the relevant constructed variables will be linear combinations of the raw variables. Other forms of background knowledge may be domain- or application-speciï¬c, such as providing a speciï¬cation of which “raw†variables are relevant for particular constructed variables and which may be disregarded. Objective functions, similarly, may take on a multitude of forms. We might seek functions of raw variables that lead to the best prediction of a particular target variable. Alternatively, we might seek functions of raw variables that lead to greatest amount of causal knowledge with respect to some target variable. Any number of other objective functions may be appropriate in any number of situations, but note that the “best†constructed variables can change depending on the objective function. Once some sensible combination of background knowledge and an objective function have been speciï¬ed, we will (hopefully) have a space of functions that is searchable. Thus, we can search for variable constructions for particular purposes. We later consider a speciï¬c example of message forum data from an online learning environment to flesh out some possibilities for this program of research. 5 A Simple Pilot Study. One might plausibly wonder whether variable construction is actually required for successful prediction. We thus ï¬rst demonstrate that predictions about a set of students enrolled in an online, several month graduate economics course can be improved through the use of constructed variables. We note at the outset, however, that an unqualiï¬ed causal interpretation of the resulting DAG is tricky at best, though further research to ï¬nd more plausible or appropriate “causal constructions†is ongoing. Nevertheless, a (graphical) representation of the probabilistic dependence structure for these variables can signiï¬cantly improve predictions. We focus on variables in three rough semantic categories provided in Table 1. Our categorization provides a rough time ordering. Background (including demographic) variables are those upon which we cannot in principle intervene but that might prove useful for predictive and/or classiï¬cation purposes. Behavioral variables measure aspects of students’ interaction with online courseware and are vital to the purpose of discovering the behavioral causes of student learning outcomes. (5) In this work we principally focus on constructing predictors and causes of a given target variable, rather than constructing the target variable itself; the latter problem is briefly discussed later. Table 1. Description of measured and constructed variables included in pilot study These are also the variables most likely to require construction to be meaningfully interpreted. Learning outcome variables are assessed at speciï¬c times within the course in our example. Two individual assignments are graded during the course while an individual ï¬nal exam is assessed at the end of the course. Final course grades are calculated including both individual assessments and assessments of a student’s work with a group of other students. Data from a sample of 815 students are provided to the PC algorithm6 along with time-ordering background knowledge. The algorithm returns a set of DAGs which imply the conditional independence relationships judged present in the data via statistical test. One DAG is chosen, and a path analytic model is estimated according to that structure. This model is provided as Fig. 6 (7). Both the structure of the model and estimated parameters characterize qualitative and quantitative relationships among the variables under consideration. As we are especially concerned with discovering the predictors and causes of learning outcomes, we focus on two particular learning outcome variables. The ï¬rst is student ï¬nal course grade (grade points). The model provides us with something of a “sanity†check of our method in this case. Among the variables directly connected to grade points are variables which constitute the basis by which the instructor assigns the ï¬nal grade, including both assignment scores and the ï¬nal exam score. Other influencers are GPA and both counts of messages, the instructor’s number of private messages to the student as well as the count of the student’s public and group messages. If we take GPA to be a proxy for general student ability in online courses of this sort (which we implicitly do by taking GPA to be a background variable as opposed to an outcome), then this seems like a sensible picture of the predictors of the ï¬nal course grade. However, the ï¬nal course grade may not be our best target for determining the causes of learning outcomes. (6) Algorithms deployed in this work are all implemented in the freely distributed Tetrad IV suite of algorithms available at http://www.phil.cmu.edu/projects/tetrad. (7) The model of Fig. 6 is judged to ï¬t the data by a relevant statistical test comparing the implied covariance matrix of this model with the sample covariance matrix [19]. Fig. 6. Estimated linear path analytic model from our pilot study. Rectangles are placed around two learning outcomes on which we focus. After all, the same instructor provides grades for assignments as well as the ï¬nal assessment via the course grade, and the ï¬nal course grade is really (in part) just a function of these components. Further, the ï¬nal course grade includes assessments of a student’s group work, so an independent, objective assessment of individual learning outcomes would be helpful. This we ï¬nd in our second learning outcome variable of interest, ï¬nal exam points. Students in this sample had different instructors, but all took a ï¬nal exam, provided by a textbook manufacturer, that was independently graded. This provides us with a relatively clean, objective instrument to assess learning outcomes with respect to the material of this online economics course. However, the set of variables directly connected to ï¬nal exam points is relatively small. We ï¬nd that the unmediated predictors of a student’s ï¬nal exam score are sex, GPA, and average score on other MBA course ï¬nal exams. This may provide support for our use of the latter two variables as proxies for ability, but we ï¬nd no direct connections between this independent learning outcome assessment and behavioral variables we consider in this analysis. This, of course, does not mean that behavior and learning outcomes are unrelated. Perhaps we have just not constructed and included the appropriate sets of behavioral variables. We must explore the intriguing possibility that we failed to appropriately “carve up†our behavioral raw data. One crucial way that student behavior is captured in this environment is through messages that students post in an online forum. We need to carefully consider ways in which we can construct variables out of this data. Our ï¬rst pass included ad hoc constructions of studentPublicGroupMessageCount and instructorPrivateMessageCount; we did not ï¬nd signiï¬cant, unmediated links to ï¬nal exam score, though some interesting relationships between demographic features, message behavior, and other variables were discovered. Given the richness and importance of student and instructor interactions via these messages, we choose forum message data as the illustrative example of the problem of variable construction. (8) 6 The Construction of New Variables. Real, deployed online courseware collects data for every forum message posted by students in a given course. We focus here on only a handful of these characteristics to motivate the problem of variable construction search. For each message we know: - message creator - message timestamp - message content - the forum in which the message was posted, and - whether the course facilitator judged the message as “substantive†or not. These messages can be organized by student9 (as message creator, excluding messages posted by course instructors), and raw variables can be created that correspond to the message attributes. In a course with 50 students, the most proliï¬c of whom posted 100 messages, this scheme results in 400 “raw†variables. A schema of the resulting data set is shown in Table 2. The form of the data (binary, real valued, text, etc.), as well as background knowledge, informs the space of functions over which we might search. A plausible objective function is to optimize predictions for each student of measured learning outcomes, such as score on a ï¬nal exam or course grade. We seek useful, meaningful constructed variables to incorporate into causal and/or predictive models. Numerous potential constructed variables arise even out of our simple toy example. A simple example involves word counts from the content ï¬elds for each student. We let C denote the indices of content ï¬elds in our dataset (C = {2, 6, 10, 14, 18, ..., 398}). (8) viewChapterCount is also an ad hoc construction from separate logs in the online courseware that could just as easily be the target of investigation for principled variable construction and search. (9) The reader might sensibly inquire why we choose to organize message data by student (as opposed to, for example, organizing the data by message). This choice provides an organization in which most variables in the data are roughly independently and identically distributed (or i.i.d.). This is necessary for us to proceed with causal discovery techniques we deploy. The data would not be i.i.d. were it organized by message, having a variable representing the message creator. Table 2. Hypothetical table for forum message “raw†variables organized by student. FORMULA(1). where wordCount is a function that count words given a ï¬eld of text as an argument, and I is the indicator function (taking value 1 when the content ï¬eld is not empty, 0 otherwise). Letting F denote the set of indices for variables that identify the particular forum in which students posted (i.e., F = {3, 7, 11, ..., 399}), we can reproduce a variable from our pilot study: FORMULA(2). Letting S denote the set of indices for variables containing the binary “substantive message†flag, which takes value 1 when a message is substantive and 0 otherwise, (i.e., S = {4, 8, 12, ..., 400}), another simple example is: FORMULA(3). We might consider many alternatives. We can deploy any number of functions just on the content and timing of messages. Guided by background knowledge and the form of the data, we iteratively search over potential variable constructions and judge them via resulting models in which they are used. We seek variable constructions and models incorporating them that maximize our ability to predict a student’s ï¬nal exam score and infer causes of learning outcomes. Given often lacking background knowledge tying together education domains with causal inference from data obtained from online courseware, as well as the dimensionality, granularity, and complexity of the latter, we are forced not only to search over potential causal structures that explain the data but also to search for the variables that take part in that modeling. Search for more, better variable constructions is currently ongoing"
About this resource...
Visits 110
Categories:
0 comments
Do you want to comment? Sign up or Sign in