In this paper we present an evaluation of new techniques for automatically detecting sentiment polarity (Positive or Negative) in the students responses to Unit of Study Evaluations (USE). The study compares categorical model and dimensional model making use of five emotion categories: Anger, Fear, Joy, Sadness, and Surprise. Joy and Surprise are taken as a Positive polarity, whereas Anger, Fear and Sadness belong to Negative polarity in the binary classes, respectively. We evaluate the performances of category-based and dimension-based emotion prediction models on the 2,940 textual responses. In the former model, WordNet-Affect is used as a linguistic lexical resource and two dimensionality reduction techniques are evaluated: Latent Semantic Analysis (LSA) and Non-negative Matrix Factorization (NMF). In the latter model, ANEW (Affective Norm for English Words), a normative database with affective terms, is employed. Despite using generic emotion categories and no syntactical analysis, NMF-based categorical model and dimensional model result in better performances above the baseline.
1. The learning outcomes and expected standards of this unit of study were clear to me. 2. The teaching in this unit of study helped me to learn effectively. 3. This unit of study helped me develop valuable graduate attributes. 4. The workload in this unit of study was too high. 5. The assessment in this unit of study allowed me to demonstrate what I had understood. 6. I can see the relevance of this unit of study to my degree. 7. It was clear to me that the staff in this unit of study were responsive to student feedback. 8. My prior learning adequately prepared me to do this unit of study. 9. The learning and teaching interaction helped me to learn in this unit of study. 10. My learning of this unit of study was supported by the faculty infrastructure. 11. I could understand the teaching staff clearly when they explained. 12. Overall I was satisfied with the quality of this unit of study. Eleven items (I1-I11) focus on students’ experience and one item (I12) on student satisfaction. Students indicate the extent of their agreement with each statement based on a 5 - point Likert scale: 1 - strongly disagree, 2 - disagree, 3 - neutral, 4 - agree and 5 - strongly agree. Below each statement there is a space requesting students to explain their response. Question 4 has a different sentiment structure therefore was removed in this study. The USEs of subjects taught by two academics collected over a period of six years were used to create the dataset. After removing responses to question 4, the dataset contains a total of 909 questionnaires (each with 11 ratings), and out of the possible 9,999, students responded with 3,008 textual responses (each expected to be a description of a rating), a textual response rate of 30.1 %. Out of these we removed internal referencing (e.g. ‘see above’) and meaningless text (e.g. ‘?’). The textual data has two characteristics that may significantly affect the classifiers. First the sentences are hand-written in an informal style, containing spelling errors, abbreviated non-dictionary words or hard to read text. The lack of proper grammar would make it extremely challenging to use part-of-speech (POS) tagging or other computational linguistic approaches. Examples include: “Computers in labs too slowk no lecture notes†(spelling mistakes and non-grammar), “tutes were overcrowded, stopping teacher / student interaction†(non-standard words). For these reasons, the techniques used in the experiment are based on the bag-of-words assumption (so word order is not used) and we do not use POS tagging that would require relatively correct grammar. Table 1. Number of comments and sample comments for each sentiment. 4 Experiments and Results. The following five different approaches are implemented in Matlab. One categorical model that has two variants, according to three corresponding methods of dimension reduction, one dimensional method, and two similarity comparison methods for each model are implemented. For evaluation purposes, we employ Majority Class Baseline (MCB) as our baseline and Keyword Spotting (KWS). We remove stop words and use stemming. Text to Matrix Generator (TMG), a Matlab toolkit [15], is used to generate term-by-sentence Matrix. • Majority Class Baseline (MCB): classification that always predicts the majority class, which in this dataset is Positive across all sentiment classifications. • Keyword Spotting (KWS): a naïve approach that counts the presence of obvious affect words like “frustrating†and “satisfactionâ€, which are extracted from WordNet-Affect for five emotion categories. • CLSA: LSA-based categorical classification • CNMF: NMF-based categorical classification • DIM: Dimension-based estimation Five emotion categories are utilized (Anger, Fear, Joy, Sadness, and Surprise) in which Joy and Surprise emotions are assigned to positive class while Anger, Fear, and Sadness are the members of negative class, respectively. Negative emotion, disgust, is removed because the emotion is similar to anger and leads to making sentiment classes biased. Likewise, strongly agree and agree belong to positive, and strongly disagree and disagree are referred to negative. The number of sentences for each rating and sentiment used in our experiment is shown in Table 1. In addition, sample comments of the annotated corpus appear in Table 1. Table 2 shows the precision, recall, and F-measure values obtained by the five approaches for the automatic classification of three sentiments. The highest results are marked in bold for each individual class. We do not include accuracy values in our results due to the imbalanced categories (see Table 1). The accuracy metric does not provide adequate information, whereas precision, recall, and F-measure can effectively evaluate the classification performance with respect to imbalanced datasets [16]. Table 2. Sentiment identification results. As can be seen from the table, the performances of each approach depend on each sentiment category. In case of the positive class, which has the largest number of sentences, MCB and CNMF get the best sentiment detection performance in terms of recall and F-measure. DIM achieves rather high precision score in comparison with all other classifications. We can see that DIM approach gives the best results for negative class. When it comes to neutral, KWS shows the best performance with respect to recall and F-measure. On the other hand, CNMF particularly outperforms the others for precision. Figure 1 indicates a result of the 3-dimensional and 2-dimensional attribute evaluation for USEs. Figure 1. Distribution of the USEs dataset in the 3-dimensional (left) and 2-dimensional (right) sentiment space. The ‘x’ denotes the location of one comment corresponding to valence, arousal, and dominance. A notable aspect observed in the USE data is that there are somewhat inconsistencies between students’ ratings and written responses illustrated with examples in Table 3. For instance, the third row is unambiguously negative but the student graded this sentence as neutral. Therefore, all approaches have a weakness in recognizing sentiments due to the peculiarity of this data. Another factor, which makes the automatic classification difficult, is that all classifiers are not specific to education domains. For this reason, we speculate that this mediocre performance of the methods is owing to poor coverage of the features found in education domains. Table 3. Sample feedbacks from misclassified results. (Positive values are those rates 4 as 5, neutral as 3 and negative 1 or 2). Table 4 shows overall precision, recall, and F-measure comparison with respect to MCB, KWS, CLSA, CNMF, and DIM in two averaging perspectives (micro-averaging and macro-averaging). The notable difference between these to calculate is that micro- averaging gives equal weight to every sentence whereas macro-averaging weights equally all the categories. From this summarized table, we can see that MCB, KWS, and CLSA perform less effectively with a little low number of evaluation scores compared with CNMF and DIM. In case of macro-averaging, CNMF is superior to other classifications in precision, while DIM surpasses the others in recall and F-measure. On the other hand, DIM has the best precision and CNMF performs better for F-measure in micro averaging. Overall, CNMF and DIM vie with each other in precision, recall and F-measure and the best F-measure is obtained with the approach based on CNMF or DIM for each average. Our KWS conducted in all experiments is inferior to CNMF, DIM as well as CLSA. The result implies that keyword spotting techniques cannot handle the sentences which evoke strong emotions through underlying meaning rather than affect keywords. In addition, we can infer that the models (CNMF and DIM) with non-negative factors are appropriate for dealing with text collections. In summary, NMF-based categorical model and dimensional model shows the better sentiment recognition performance as a whole. The most frequent words used by students to describe aspects of their experience, include terms such as labs, lecturer, lectures, students, tutors, subject, and work. When we remove these terms, the words most frequently used to describe positive experiences include: good (n=263), helpful and helped (n=183), online (n=79), understand (n=49). Those used to describe negative experiences include: hard (n=72), understand (n=67), time (n=47). Neutral experiences contain a combination of both. These words lists are obtained from CNMF and DIM because two classifications have better overall performance as aforementioned. Stemming was not used for this analysis since in this particular corpus it might hide important differences as between ‘lecturer’ and ‘lecture’. Table 4. Overall average resutls. 5 Discussion. This paper described a dataset of ratings and textual responses of student evaluations of teaching. Sentiment analysis techniques for automatically rating textual responses as positive, negative or neutral using the students’ ratings were evaluated. In particular, the performance of categorical model and dimensional model were compared, each of which makes use of different linguistic resources. This paper highlighted that NMF-based categorical and dimensional models have a better performance than the others. Moreover, despite not having an appropriate set of emotional categories to use, the efficacy of two emotion lexicons (WordNet-Affect and ANEW) promises to be useful in these sentiment classification tasks. While two models and two lexicons are promising for identifying sentiments, there are still challenges to overcome. We believe that affective expressivity of text is on the basis of more complex linguistic features such as morphological features. Hence, we are going to delve into Natural Language Processing (NLP) to recognize fine-grained emotion in the future. Future work will include extending the corpora with more student evaluations and this should provide more reliable results. The categorical model should be evaluated with a set of emotion categories better grounded in the educational research literature and we suspect that the literature on motivation would be particularly useful. With regards to the use of normative databases to study the dimensional model, we are aware that the terms in ANEW are not the best suited for the vocabulary that students use to describe their experiences, but we are not aware of other more appropriate databases. Acknowledgement. This project was partially funded by a TIES grant from the University of Sydney.
About this resource...
Visits 162
Categories:
0 comments
Do you want to comment? Sign up or Sign in