Mining Free-form Spoken Responses to Tutor Prompts

Albert T. Corbett

Christina Trotochaud

Jack Mostow

Joseph Valeri

Nell Duke

Xiaonan Zhang

Proceedings of Educational Data Mining, 2008

2008 2008

How can an automated tutor assess childrenâ€™s spoken responses despite imperfect speech recognition? We address this challenge in the context of tutoring children in explicit strategies for reading comprehension. We report initial progress on collecting, annotating, and mining their spoken responses. Collection and annotation yield authentic but sparse data, which we use to synthesize additional realistic data. We train and evaluate a classifier to estimate the probability that a response mentions a given target.

"1. Table 1: Predictors used in the logistic regression model. To combine this information, we use binomial (or binary) logistic regression, which estimates the probability of an event Y as a logistic function of a set of input predictors X1, X2, â€¦, Xn . In our case, Y = 1 iff a target occurs in an utterance, and X1, â€¦, X5 are the five predictor variables in Table 1. The logit (i.e., the logarithm of the odds) of the target occurring is modeled as a linear function of the Xi, as shown in Equation 1: FORMULA_1. Here Pr(occur) is the probability that the target occurs in the utterance, 0Î² is the intercept, and 1Î² , â€¦, 5Î² are the respective regression coefficients for the predictors in Table 1. The regression coefficient for each predictor describes the change in the logit associated with a unit change in that predictor. A positive (negative) Î² means that an increase in the predictor will increase (decrease) the probability of the outcome. To make different Î²â€™s comparable, we first normalize the input predictors to range from 0 to 1, so that the absolute value of Î² measures the impact of that predictor compared to the others. Given Pr(occur) for a target, we decide whether the target occurs by comparing Pr(occur) to a threshold, e.g. 0.5. We decide yes if itâ€™s larger than the threshold, otherwise no. We use a logistic regression model for several reasons. First, itâ€™s compact to represent, fast to compute, and easy to interpret. Second, unlike linear regression it does not assume normally distributed variables. Third, rather than a binary judgment as to whether the target occurs, it outputs a probability that a tutor could use to decide more judiciously which feedback to provide. For example, if the tutor thinks the student said the target but is not very confident, it should hedge its reply rather than praise an answer that may well be wrong. Finally, logistic regression outperformed the alternatives we compared it to. In cross-validation tests, it achieved higher precision, recall, and AUC (described in Section 4) than a NaÃ¯ve Bayes classifier or a J48 decision tree. We used Weka 3.5.7 (from weka.sourceforge.net) to train the logistic regression model on the 9891 synthetic examples. As noted earlier, the class distribution on synthetic training data is skewed, with 9547 negative examples but only 344 positive examples for the 21 targets defined. In contrast, the 64 held-out authentic utterances are more balanced, comprising 30 positive instances and 34 negative instances. Differences in class distribution between training data and test data can hurt classifier performance, for instance by biasing the classifier against a class rare in the training set but common in the test set. To address this problem, we used Wekaâ€™s cost-sensitive classification mechanism to balance the training data, so that its distribution of positive and negative instances resembles the distribution on authentic data. Table 2 shows the resulting Î² parameter estimates for our five predictors. Table 2: Parameter estimates of the logistic regression model. As Table 2 shows, all predictors are positively correlated with the odds that the target occurs, but acoustic confidence is the strongest predictor. Although one might expect long responses to be likelier to contain the target than short responses, the UttDur predictor is very weak, probably because we measured it by the size of the audio recording. This recording includes the tutor prompt in the background, so its size reflects the combined duration of the prompt and the studentâ€™s utterance. 4 Evaluation. We tested our logistic regression model on both synthetic and authentic data. We used 10-fold cross validation on the synthetic training data. We also evaluated the model on the 64 authentically labeled utterances we used as held-out test data. We compared against a majority class baseline model, which simply predicts the most common class for all instances. Table 3 compares the model performance on both data sets. We evaluate the classifiers on several metrics. Overall accuracy is the fraction of cases classified correctly, i.e. (# TP (true positive) + # TN (true negative)) / # total cases, so it reflects the class distribution. The TP rate, also called sensitivity or recall, is the fraction # TP / (# TP + # FN) of actual positive cases correctly classified as positive. The FP (false positive) rate is the fraction # FP / (# TN + # FP) of actual negative cases misclassified as positive. Its complement, called specificity, measures what fraction of actual negative cases is classified correctly as negative. All these metrics depend on the probability threshold for classifying a case as positive â€“ namely 0.5 for our model. Cross validation of the majority class baseline shows very high accuracy and zero FP rate because the synthetic data is highly skewed toward negative examples; its accuracy is much lower on the authentic data. More importantly, such a classifier is useless because it cannot detect any mention of the target: its TP Rate is 0. In contrast, the logistic regression model is much more sensitive to positive examples. Table 3: Model performance under different testing options. In practice, for the probabilistic output of logistic regression model Pr(occur) to be useful, we need to turn the probabilities into discrete decisions so as to provide tutorial feedback accordingly. For example, if the tutor is very sure that the target didnâ€™t occur, it should give corrective feedback; but if itâ€™s not sure, then a hedged reply is probably preferable. With this intuition, we decide on a preliminary division of Pr(occur) into 3 disjoint regions, based on two threshold values th and tl (0 < tl < th <1): â€¢ Yes: confident that the target occurred in the utterance (Pr(occur) â‰¥ th); â€¢ No: confident that the target didnâ€™t occur in the utterance (Pr(occur) â‰¤ tl); â€¢ Unsure: neither (tl < Pr(occur) < th). These thresholds control the tradeoff between coverage and precision. The higher the value of th, the fewer Yes decisions the tutor will make, but the more confident it can be of these decisions (assuming we have a reasonable model). On the other hand, the tutor will hedge more of its feedback, presumably making it less helpful to students. To describe this tradeoff, Table 4 shows model coverage and precision on the set of 64 authentic responses for various threshold values. In the table, Pr(Yes) and Pr(No ) mean the probability of outputting a Yes and a No decision, respectively. Precision is the proportion of Yes (No) decisions that are in fact correct, i.e., positive (negative) examples. For example, with high_threshold = 0.9 the tutor will decide only about 14% of the time that Yes, the student mentioned the target â€“ but roughly 89% of these decisions will be correct. By dropping this threshold to 0.5, it can decide Yes more than twice as often â€“ almost 30% of responses â€“ and still be right about 84% of them. Table 4: Model coverage and precision with different threshold values. Table 4 provides guidance both about where to set the threshold values, and about how definitively to phrase tutor feedback. For example, it indicates that precision for Yes decisions is roughly the same (81%-89%) for thresholds from 0.5 to 0.9, so the tutor may as well set high_threshold at 0.5 (possibly even lower) in order to decide Yes more often, but its feedback must reflect that the student response probably contains the target but may well not. For example, the tutor might refrain from confirming the answer as correct, but still treat it as correct in updating its student model. In contrast, precision for No decisions is much more sensitive, ranging from 71% to 100% as low_threshold varies from 0.5 down to 0.1 â€“ but with coverage ranging from over 70% to below 11%. So the tradeoff between coverage and precision differs for the No case. If our authentic training data is representative, setting low_threshold to 0.1 will avoid any false rejections, allowing definitively phrased corrective feedback. However, at this threshold value, the tutor will decide No less than 11% of the time, even though the target will be absent about half the time. On the other hand, a value of 0.5 will let the tutor decide No for 70% of student responses, but only 71% of these decisions will be correct. In this case, tutor feedback must be phrased to avoid characterizing the student response as wrong. 5 Contributions and future work. This paper formulates the general problem of extracting reliable, tutorially useful information from childrenâ€™s free-form spoken responses despite imperfect speech recognition, so as to assess their comprehension and select appropriate tutor feedback. We focus on the simpler but common and useful case of estimating the probability that an utterance mentions a given target concept. We describe efficient methods to collect authentic student data labeled by expert tutors, and to expand it into a much larger set of synthetic yet realistic data. We present a logistical regression model to estimate the probability of a target by combining features of the target and utterance with the acoustic confidence output by a speech recognizer. We cross-validate the accuracy of the resulting probability estimates on synthetic data, and evaluate it on a smaller held-out set of authentic data. Concept mention is just one useful feature for tutors to detect. We need to extend it to handle synonyms, but we have already extended it (in work omitted here to save space) from the single-target problem addressed in this paper to the multiple-target problem of deciding whether an utterance mentions any, all, or none of N given targets. Another useful feature is the distinction between confident and tentative responses [6, 7]. Other distinctions in our expert tutorâ€™s annotations include correct vs. incorrect, vague vs. detailed, and answered easily vs. with difficulty. Future work includes using these distinctions to update student models and guide tutor decisions."

¿Cómo puedes configurar o deshabilitar tus cookies?

Mining Free-form Spoken Responses to Tutor Prompts

InProceedings