"Simulated environments for learning are becoming increasingly popular to support experiential learning in complex domains. A key challenge when designing simulated learning environments is how to align the experience in the simulated world with real world experiences. Social media resources provide user-generated content that is rich in digital traces of real world experiences. People comments, tweets, and blog posts in social spaces can reveal interesting aspects of real world situations or can show what particular group of users is interested in or aware of. This paper examines a systematic way to analyze user-generated content in social media resources to provide useful information for learning simulator design. A hybrid framework exploiting Machine Learning and Semantics for social group profiling is presented. The framework has five stages: (1) Retrieval of user-generated content from the social resource (2) Content noise filtration, removing spam, abuse, and content irrelevant to the learning domain; (3) Deriving individual social profiles for the content authors; (4) Clustering of individuals into groups of similar authors; and (5) Deriving group profiles, where interesting concepts suitable for the use in simulated learning systems are extracted from the aggregated content authored by each group. The framework is applied to derive group profiles by mining user comments on YouTube videos. The application is evaluated in an experimental study within the context of learning interpersonal skills in job interviews. The paper discusses how the YouTubebased group profiles can be used to facilitate the design of a job interview skills learning simulator, considering: (1) identifying learning needs based on digital traces of real world experiences; and (2) augmenting learner models in simulators based on group characteristics derived from social media."
"1. INTRODUCTION. 1.1 Facilitating Learning Simulator Design. Simulated environments for experiential learning, or learning simulators, create a practical and social context in which new skills can be learned, applied and mastered. These environments are increasingly popular as a means to turn experience into knowledge, and are being applied in a variety of domains and learning contexts. A key success factor for learning simulators is the ability to connect learners to the real job-world, helping them to recognize what they need to learn. However, existing learning simulators suffer from the lack of such ability. This is because they incorporate a limited understanding of the learner based on skills and knowledge acquired only within the simulated world and disconnected from the learners’ real job experiences. The poor understanding of the learner needs is mainly due to the limited scope of the initial learner model, where the simulator scope for observing the learner is constrained within the particular application. Such a limited learner model leads to missing key learning aspects. These aspects may include learning domain concepts within a real world situation, which the learner, or people who share similar characteristics with him, are either aware of, or need to know more about. For example, the interpersonal skills and body language signals that job applicants should be aware of during job interviews. Deriving user profiles that include these learning aspects may help in facilitating the design of the learning simulators in two main aspects: 1. The derived profiles can provide a means for the trainers or the content providers to identify key learning needs for the simulator learners. 2. The learning domain concepts derived from the profiles can form a rich resource for augmenting the limited learner model to overcome the classic ‘cold start’ problem in user adaptive environments. 1.2 Motivation: Mining Social Media to Support Learning. Social media stand for a new culture of participation on the Web. Figure 1. Generic Framework to Derive Group Profiles from Social Media Resources to Facilitate Learning Simulators Design. In the last decade, people have been more and more involved into contributing and shaping content on the Web. In social media, people may comment on multimedia objects like YouTube1 videos and Flickr1 images, share their thoughts on micro-blogging systems like Twitter1, publish their bookmarks on Delicious1, or use services like CiteULike1 to organize scientific publications they are interested in. With the increasing popularity of social media resources, it is likely that people leave authentic digital traces of their profiles and real life experiences in the domain of their interests on social media sites [1]. Mining such digital traces promises to be very beneficial for various applications [5]. The motivation for this work arises from the interesting challenge of mining social media resources to bridge the gap between the simulated world in learning simulators and the actual experiences in the real job-world and better understand the learner’s needs in the context of the learning domain of interest and within the real world community the learner belongs to. This interesting challenge leads to several research questions that we would like to answer with this work, namely: - How to mine the digital traces in social spaces to derive profiles of user groups? - Can the derived group profiles be used to identify key learning needs for potential learners? - Can the derived group profiles be used to augment the learner model with useful learning domain concepts? To answer the questions, this work presents a novel hybrid framework by combining knowledge representation as provided by semantics with knowledge discovery as exploited by machine learning, for mining the user-generated content retrieved from social media resources to derive profiles of user groups that can be further exploited to facilitate the design of learning simulators. The framework is perceived as a key contribution to mining social media in learning analytics to support learning simulators design. The rest of the paper is organized as follows: In the next Section, we introduce a novel, semantically-enriched machine learning framework that derives social group profiles from social media to facilitate design of learning simulators. In Section 3, the framework is instantiated within a case study to derive group profiles from mining user comments on YouTube videos. In Section 4, an experimental study is conducted to evaluate the application of our framework within the YouTube-based content. In Section 5, we position our work in the relevant literature on mining social media to support learning. The paper concludes in Section 6, pointing at requisites for replicating the study and future work. 2. DERIVING GROUP PROFILES FROM SOCIAL MEDIA: GENERIC FRAMEWORK. To address our first research question, we introduce a hybrid framework that integrates supervised machine learning represented by classification, unsupervised machine learning represented by clustering, and semantics. Figure 1 depicts the generic framework processes and the flow of data processing. The processes are described as follows: 1. User-Generated Content Retrieval: The digital traces that users create on the selected social media resource are retrieved using a content search process. Search is tailored toward retrieving content that falls within the context of the learning domain of interest. This is dependent of the social media resource selected. For example, to retrieve the user-created discussions on job interviews from YouTube, a collection of videos on job interviews can be selected by domain experts (e.g expert job interviewers) and the comments on these videos can be retrieved using the YouTube API. 2. Semantically-enriched Classification Model: Because social media content is full of noise, such as content that is irrelevant to the learning domain of interest, spam, and abuse content. We introduce a supervised machine learning process that classifies the content into relevant and noisy and selects the relevant content for subsequent processing. The process employs a semantically expanded Bag of Words (BoW) and a content scoring mathematical model described in [3] to score and label the training data set used for building the classification model to filter the noisy social media content. 3. Semantic-driven Individual User Profiling: using the filtered user-generated content, this process constructs a social user profile for every content author by merging all the content written by the author into a term weight vector representation. A semantically-enriched filtration layer consisting of domain concepts relevant to the learning domain is used to represent the content author. Using the semantically-enriched layer, the domain-relevant terms that the author used in the content written by him are extracted and included into his social profile. Furthermore, available demographic information that the author may have input in his social media user profile is retrieved and integrated into his user profile. 4. Group Clustering Model: An unsupervised machine learning process is exploited to cluster the social user profiles generated in process 3 into groups of users based on the similarity of their term weight vector representation. Clustering validity measures are used to determine the number of distinct groups that are mostly representative to the users based on the content they wrote. 5. Group Profiling: The learning domain concepts that the content authors who belong to each derived group are interested in or aware of are extracted from the cluster centroid of that group. In addition, statistical analysis is performed on the demographic information of the authors. The profile of each group is derived using the extracted domain-relevant concepts and statistical distributions of the authors’ demographics. Figure 2. Instantiated Framework to Derive Group Profiles from User Comments on YouTube Videos to Facilitate Design of Simulated Environments for Learning. Interpersonal Skills in Job Interviews. In the next section, we illustrate how the generic framework can be exemplified within the context of a specific social media resource, namely the YouTube video sharing site. For this, we present a YouTube-driven instantiation of the framework to derive social group profiles from YouTube-based corpus. 3. DERIVING GROUP PROFILES FROM YOUTUBE: INSTANTIATED FRAMEWORK. YouTube has become the most successful Internet website providing a new generation of short video sharing service. Since its establishment in early 2005, more than 100 million videos are being watched every day on YouTube, making it to rank second in traffic among all the websites in the Internet by the survey of Alexa [6]. YouTube provides several social tools for community interaction, including the possibility to comment on published videos. The analysis of comments constitutes a potentially interesting data source to mine for obtaining implicit knowledge about users, videos, categories and community interests [17]. YouTube allows the batch retrieval of these comments with its public API. In addition to comments, YouTube enables registered users to create individual user profiles on the site. A YouTube user profile contains information about a user, such as the user's age, gender, country of residence, hobbies, occupation, or favorite books, music and movies. Any personal information that appears in a user profile feed will have been entered by that user for publication on YouTube. The YouTube Data API allows you to retrieve user profiles. Motivated by the data acquisition and analysis opportunities that mining YouTube corpus can bring, we present a YouTube-driven instantiation of our framework to derive group profiles purely based on YouTube corpus. The derived group profiles are expected to provide two main functions: (i) help training professionals to identify key learning needs that can be considered in the storyboarding design of learning simulators, and (ii) identify learning domain concepts that can be augmented in the model of a learner who shares similar demographics with all users in one group. Figure 2 illustrates the components of this YouTube-driven instantiation. To demonstrate the process flow of the framework instantiation, we aim to derive group profiles that facilitate the design of learning simulators that train users to be aware of the various interpersonal communication skills for the job interview. Here, users can be either inexperienced job interviewers, or job applicants. Both job interviewers and job applicants need to know more about the different interpersonal communications during the interview session in order to improve their job interviewing skills, or increase their chances of getting hired after conducting the job interviews, respectively. The framework process flow consists of 5 main stages: Stage 1 - Video Selection: Uploaded YouTube videos about Job Interviews are selected for the retrieval of their public comments. Selection of the videos can be based on Domain expert selection, or keyword-based search of the YouTube API. The YouTube API provides many search ‘filters’ to improve the relevance of the search results. This includes filtering search results to show only videos that match a given set of categories and/or user-defined keywords. Each video can have many keywords but can only be associated with one YouTube category Stage 2 - Comment Retrieval: all the public comments on the videos selected in Stage 1 are retrieved using the YouTube API. Stage 3 - Filtration of Noisy Comments: Using the semantically-enriched machine learning technique described in [3], a YouTube noisy comment filtration component is implemented to predict the noisy comments and filter them out from the retrieved comment set. The predictive model is trained using corpus retrieved from YouTube classified into two classes; relevant and noisy, using a novel mathematical scoring model and a semantically expanded Bag of Words that consist of concepts highly relevant to the job interview domain. Noisy comments are those that (i) contain too little job interview-related concepts; (ii) are totally irrelevant to Job Interviews; or (ii) are spam or abusive comments. The remaining comments are those that contain a considerable rate of Job Interview-related concepts. Table 1 shows four example comments on the left that have been predicted as noisy by the noise filtration component. Obviously, the first three ones do not comment on the job interview video being watched, whereas the fourth one is a spam. The machine learning-based component aligned with that, classifying them as noise. The two comments on the right clearly contain concepts that are relevant to the job interview activity. Therefore, the component classified them as relevant. Table 1. YouTube Comments, with their Scores and Labels Determined by the Machine Learning Component. Stage 4 - Retrieving YouTube User Profiles: Given the subset of comments that have been filtered using the noise filtration component, the YouTube Data API is utilized for each comment in order to retrieve the demographic characteristics of the comment author. Firstly, given the comment entry element in the comments feed, the comment author username is retrieved for that comment. Secondly, given the comment author username, the demographic characteristics are retrieved from the YouTube user profile, which contains information that the user lists on his YouTube profile page. The author’s age, gender, and location are retrieved. Stage 5a - Clustering-based Group Profiles: Groups of comment authors are derived based on content similarity of the video comments they write. A semantically-enriched clustering algorithm is employed to derive the group profiles. The objective of the derived groups is to support trainers in identifying key learning needs to embed in the design of the learning simulator storyboarding design. The algorithm is further described in Section 3.1. Stage 5b - Demographic-based Group Profiles: Groups of comment authors can also be constructed based on user predefined demographics. The objective of the derived group is to identify job interview-related concepts that users who share specific demographics with a potential learner are interested in or aware of. These identified concepts can be used to augment the model of that learner. For example, given two adult female learners who live in the United States and Great Britain, stage 5b in Figure 2 shows synthetic key Job Interview Concepts derived from the comments of female adults who live in US & GB. The demographic-based group profiling method is further described in Section 3.2. 3.1 Clustering-based Group Profiling. The methodology of the Clustering-based Group Profiling consists of the following steps: 3.1.1 Semantic filtration of the Comments Content. The textual content of each comment output by the noise filtration component is represented by the terms that exist in a semantically enriched Job Interview-related Bag of Words (BoW). This BoW is derived from an experimental study documented in [9]. Pre-processing the experimentally-controlled user comments includes two main steps: Text Preprocessing: includes NLP techniques for text analysis using the Antelope NLP framework2, i.e. sentence splitting, tokenization, Part of Speech tagging and syntactic parsing using the Stanford parser for linguistic analysis. This enables the extraction of a structured form text representation to empower further analysis using semantics. Semantic Enrichment: representing Ontology-based word sense disambiguation and linguistic semantic text expansion. The first filter applied concerns the selection of specific lexical categories implemented within the WordNet3 Lexicon English language thesaurus to directly exclude non-significant terms for the job interview activity. For the words remained, the Suggested Upper Merged Ontology (SUMO) [7] has been exploited to provide direct mappings of WordNet English word units. WordNet Lexicon queries were then performed to retrieve synonyms, antonyms and word lexical derivations to expand the word set. Furthermore, DISCO [13] has been exploited to retrieve distributionally similar words from the Wikipedia corpus, and the filters discussed above have been applied, i.e. lexical category and SUMO concept mapping. Table 2. Sample YouTube comment in original form (top) and reduced form (bottom). Figure 3. Text Clustering Process in RapidMiner to Cluster the YouTube Comment Authors into User Groups based on Learning Domain Concept Similarity in Their Comments. Each relevant YouTube comment is represented by only the terms that exist in the semantically-enriched Bag of Words. Table 2 depicts a sample YouTube comment in its original form (top) as well as in its reduced, annotated form (bottom). 3.1.2 Representing Authors by Their Filtered Comments. Given the semantically-filtered YouTube comments derived according to the method Section 3.1.1 and the author usernames retrieved in Stage 4, each comment author is represented by a term vector that consists of an aggregation of all the comments he wrote. Each derived term vector corresponds to one author. Table 3 depicts a term vector representation of one comment author, which is derived from combining two different annotated comments. The original two comments are shown on the right side of Table 3. Table 3. Term vector representation of Comment Authors. As can be seen in table 3, the vector terms derived from each comment are depicted in a color similar to the original comment. This vector representation of the author allows us to infer the overall awareness of that author in the job interview domain. For example, the author’s first comment indicates that the author is already aware of body language aspects in business, whereas his second comment indicates that the video enabled him to become aware of proper handshaking. The overall awareness can be seen by the author’s term vector representation. 3.1.3 Clustering Comment Authors. A RapidMiner4 process has been built to perform text clustering of the comment authors, where each unique author is represented by a term vector derived according to the method described in Section 3.1.2. Figure 3 shows the flow of operators used in the text clustering process. The functionality of each operator in the RapidMiner process is described as follows: - Read Database: Reads the term vector dataset from the database into the RapidMiner process. - Set Role: Sets the role of the author username as a row identifier and assigns the term vector attribute as an input for text clustering. - Extract Docs: Builds a text document representation for each term vector from the input dataset. This step is required for text clustering in RapidMiner. - Pre-Processing: Performs a number of text pre-processing sub-operators on the input text documents. These include: (i) tokenization, (ii) transforming to lower cases, (iii) stop word removal, (iv) filtering too short terms, and (v) generating bigram phrases. - Clustering: Clusters the comment authors using the Feature Weighting K-Means Algorithm, which has been previously used to perform text clustering with good performance results [11]. The optimum value of K (the number of clusters) is selected after performing a clustering evaluation strategy in RapidMiner. The clustering evaluation strategy is described in Section 4.2. - Centroids: Extracts the cluster centroids from the resulted derived clusters. Each centroid represents a vector of the TFIDF weights of the terms in each cluster. These are written to a CSV file using the Write CSV operator. - Select Attribs: Extracts the cluster membership (cluster number) that each comment author belongs to. This is then written back to the database using the Write Database operator. 3.1.4 Deriving the Group Profiles. A Profile for each group of comment authors is derived in this step, where each profile consists of the following elements: - Percentage of the comment authors in the group. - List of the job interview-related terms the authors of the group are interested in or aware of. These concepts are retrieved from the cluster centroid elements having maximum TFIDF weights. - Percentages of the gender distribution. - Percentages of the age groups in years (11-20, 21-30, 31-40, etc.). - Percentages of the location (country) distribution. - Sample comments written by authors who belong to the group. These comments are selected according to their relevance scores. The scores are computed based on the mathematical model presented in [3]. The top n comments having the maximum relevance scores are shown in the profile, where n can be a fixed number (e.g 1 comment) or a proportion of the comments written by the group authors (e.g 10%). Figure 4. Statistical Distribution of Three Demographic Properties for the YouTube Comment Authors. 3.2 Demographic-based Group Profiling. As discussed in Section 1, the initial learner models in simulated learning environments do not contain sufficient information about the learner. This creates a user-adaptive problem for learning simulators to well adapt the learner needs. On the other hand, users who share demographic characteristics may have similar interests or can be aware of the same concepts. For example, people who are of certain age groups may be interested of the same kinds of songs. Collaborative filtering techniques in recommender systems are based on similar concept, where users are recommended items based on how their user preferences are similar to each other [20]. To address the limited scope problem of the initial learner model, a group profile for the YouTube users who share demographic characteristics with the learner can be derived by aggregating all the comment author vectors that meet specific demographic criteria. Then, the key job interview concepts that the users in this group are interested in can be identified as the list of terms having the maximum n TFIDF weights. These concepts can then be augmented into the model of the learner. 4. EXPERIMENTAL STUDY. To evaluate the framework instantiation in YouTube, an experimental study has been conducted. In the following Subsections, we describe the YouTube data that has been used in the study; the clustering validation techniques that have been employed to determine the number of groups to derive by the group clustering component of the framework; and present sample group profiles, demonstrating in examples how the derived group profiles can be used to identify key learning needs and augment initial learner models. 4.1 Data Description. Prototypical versions of the clustering-based and demographicbased group profiling components in the framework have been implemented to illustrate how the framework can be exploited to facilitate the design of a simulated environment for learning interpersonal skills and good practices in job interviews. Table 4 provides a statistical description of the experimental data. Table 4. Overview of the YouTube Corpus. Seventeen YouTube videos that show teaching examples of interpersonal skills in job interviews have been selected by training professionals who are expert in the job interview domain. As can be noticed from table 4, Most of the public comments (68%) on those videos are irrelevant to the job interview domain although they are within the comments on videos that are highly relevant to job interviews. This shows that filtering out noisy comments from high traffic social media resources, such as YouTube, is important to remove the comments that are not valuable in deriving group profiles, which can describe the awareness of the comment authors in the job interview domain concepts. This noise removal results in producing better clusters and reduces the computational time of the clustering process. Figure 4 depicts the gender distribution (a); the user age groups in years (b); and the 10 most frequent countries the authors are located in (c). These demographic statistics suggest that most comment authors who write relevant comments about job interview videos on YouTube are adult males who are located in Western countries. However, the figures also show that female users, elderly users, and users who live in different parts of the world, such as Turkey, Canada, Asia, and Australia, write relevant comments on YouTube job interview videos. Moreover, it would be interesting to compare the distribution of the demographic characteristics for each derived group with a generic profile for the overall YouTube community, provided such a generic profile exists. 4.2 Clustering Validation. In order to select the right number of clusters for the clustering based group profiling to derive the YouTube groups, 9 feature weighting k-means clustering models are trained using the input dataset, where each model has a unique number of clusters (k). For each model trained, two cluster validity measures implemented in RapidMiner are computed: (i) Davies Bouldin index [8], and (ii) Cluster density performance measure. Davies Bouldin index is the ratio of the sum of within-cluster scatter to between-cluster separation. This ratio has smaller values for a model that derives clusters which are more compact and farer from each other than other models. The cluster density performance calculates the average distance between the YouTube comment authors in each cluster and multiplies the result by the number of comment authors in that cluster minus 1. The Euclidean distance is used as the distance measure. The smaller the density value is, the more similar the comment authors are to each other in the cluster, and thus the more compact the cluster is. Figure 5 shows the Davies Bouldin index (top) and average cluster density values (bottom) for the 9 feature weighting k-means clustering models, with (k) ranging between 2 and 10. Figure 5. Davies Bouldin Indices (top) & Cluster Density Values (bottom) for 9 Clustering Models. Figure 5 (top) shows that the model having 9 clusters produce the minimum Davies Bouldin index. Because RapidMiner negates the density values, the best cluster density values are those closest to zero. As can be seen in Figure 5 (bottom), the cluster density values become closer to zero as the value of (k) increases. However, the difference of the increase in density after the model with k = 8 clusters becomes insignificant with the subsequent models. The consideration of the two models suggests that clustering the YouTube comment authors into 8, 9, or 10 clusters can derive groups of users who are relatively close to each other inside each group and far from each other from one group to another. In the sext subsection, we base our analysis on the model that derives 9 groups, as suggested by the Davies Bouldin index results. 4.3 Example Group Profiles and Usage. 4.3.1 Identification of Learning Needs. We illustrate in the example how clustering-based group profiles can be used by training professionals to identify learning needs for group of learners who meet the demographic criteria of the users who belong to the example groups. Table 5 shows the details of an interesting sample group which contains around 10% of the comment authors. The key job interview-related terms in this group suggest that the users who have been associated to this group by the framework express anxiety (e.g “anxiousâ€, “nervousâ€, “worriedâ€). Other bigram phrases detected by the framework, such as “good_wordsâ€, “question_interviewerâ€, and “job_experience†suggest that these topics could be the reason for the anxiety expressed in the comments. In the GUI interface, the identified concepts can be used by training professionals to browse the comments written by the users who are associated to this group to better understand the reason behind their anxiety. We simulated that by showing sample comments written by authors who are associated to this group by the clustering algorithm. As the first sample comment reads, the author is a potential job applicant who is worried that the interviewer might ask him about his little previous job experience. Hence, he seeks an advice on proper answers (good words) that he can say to well justify such a potential question during his job interview. Similarly, the second comment author is nervous because he is not sure what good questions he may ask the interviewer if he has been asked to do so during the interview. The trainers can identify these learning needs by depicting the identified learning concepts and use them to browse and read the authors’ comments that are linked to these concepts. Furthermore, based on the distribution of the demographic characteristics of the group users, the trainer may link these learning needs to adult applicants who live mostly in the US and Europe, as the age group and the location distributions suggest for this group. 4.3.2 Learner Model Augmentation. As described in Section 3.2, group profiles may also be derived by aggregating all the individual user profiles of the comment authors – represented by their semantically-filtered comment term vectors – whose demographic characteristics, such as their age groups and locations, match a potential learner for a learning simulator. The learning domain concepts extracted from the group are augmented into the model of the learner to overcome the cold start problem, providing the learning simulator with more learning domain-relevant information about the learner. To evaluate the effect of having different demographic properties on the learning domain concepts, we assume an artificial example of having three adult learners with different known locations: Great Britain (GB), United States (US) and Asia. Given their ages and locations as input to the framework, three demographic-based group profiles are derived by aggregating the term vectors of the YouTube individual comment authors who meet the demographic properties of each of the learners. Table 6 depicts the three different demographic properties. As can be seen, the key learning concepts that the leaners are interested in or aware of considerably differ from a group to another. Differences are illustrated in three job interview aspects: - Body Language Signals: the frequent body language signal that adults in GB talk about is the “eye contactâ€. However, for US adults. On the other hand, other body parts used in body language signals, such as “fingers†and “handsâ€, frequently exist in the comments written by US users. Extracted concepts from content written by users in Asia do not suggest that they are interested in or aware of body language signals. - Emotions: US adult job candidates are more inclined to express their anxious emotional states when talking about job interviews. This can be sensed from the “nervous†concept being only in the group of US adults. GB adults on the other hand tend to show more confidence by using terms like “hope†and “helpfulâ€. - Interests: Asian users show more interests in watching interview and job hunting guides than users in US and GB. Interest by users in Asia and US in the financial aspect can also be seen in relevant terms such as “money†(Asia, US) and “pay†(Asia). US users show more interests in “companies†and “education†in addition to money. Both GB and US users tend to mention the “interviewer†more frequently than mentioning the interviewee, as opposite to users in Asia who tend to mention the “candidate†more frequently. Mentioning humans, as can be seen in terms like “people†and “girl†is more apparent in comments written by authors who live in the US. Table 5. A Sample Clustering-based Profile for a YouTube Group having 10% of Comment Authors. Table 6. Demographic-based Group Profiles. 5. RELATED WORK. In this Section, we position our work within the literature works in three main relevant aspects, namely: (i) Augmenting user models for personalization and adaptation in simulated environments, (ii) Mining user-generated content in social media for learning support and learner modeling, and (iii) Exploiting machine learning techniques to mine social media content. Augmenting user models by deriving characteristics of other similar users for personalization and adaptation in digital environment is well founded in the literature. [16] presented a machine learning framework that addresses the lack on interoperability between different multiple electronic systems "
About this resource...
Visits 170
Categories:
Tags:
0 comments
Do you want to comment? Sign up or Sign in