formularioHidden
formularioRDF
Login

Sign up

 

Clustering by Usage: Higher Order Co-occurrences of Learning Objects

InProceedings

In this paper, we introduce a new way of detecting semantic similarities between learning objects by analyzing their usage in a web portal. Our approach does not rely on the content of the learning objects or on the relations between the users and the learning objects but on usage-based relations between the objects themselves. The technique we apply for calculating higher order co-occurrences to create semantically homogenous clusters of data objects is taken from corpus driven lexicology where it is used to cluster words. We expect the members of a higher order co-occurrence class to be similar according to their content and present the evaluations of that assumption using two teaching and learning systems.

"1. INTRODUCTION. In this paper we present a new way to cluster learning objects into semantically homogenous groups without considering their content or additional semantic metadata. We do so by just taking the usage of the learning objects, i.e. the interaction of the learners with the objects into account. For this purpose, we borrow the technique of calculating higher order co-occurrences from corpus linguistics where it is used to cluster semantically similar words and apply it to the usage of learning objects. In linguistics, first order co-occurrences of a word can be calculated by taking the context of that word into account, e.g. the sentences in which it occurs. Second order co-occurrences can be calculated by taking the co-occurrences of the first order co-occurrences into account and so forth. For example: drink and beer are co-occurrences, as well as drink and wine. Therefore, beer and wine are co-occurrences in the first order co-occurrence class of drink, this means beer and wine are second order co-occurrences. Higher order co-occurrence classes of words tend to be semantically homogenous; this is to say they are similar according to specific attributes, e.g. their direct hypernym. While in linguistics the basic unit is a word that is used in sentences, we transfer this approach to usage data and consider learning objects as basic units that are used in sessions. Therefore, the context of a learning object consists of all sessions the object was accessed in and a session consists of all objects that were accessed in that session. Thus, if two learning objects were used in the same session, they are said to be co-occurrences. In this paper we address the question whether higher order co-occurrence classes of learning objects become semantically homogenous, like the analogue classes of words. This is a non-trivial question; it is by no means necessary that the semantic convergence of words in higher order co-occurrence classes can also be observed for entire learning objects. However, the possibility to derive semantic similarity from usage is highly promising for diverse applications of information retrieval in learning settings and thus the question is worth solving. We evaluate the potential of this approach using the usage data collected in the systems MACE [1] and Travel Well [2]. The rest of the paper is structured as follows: In section 2 we briefly report on the related work in the area of clustering semantically similar data objects such as documents and pictures. In section 3 we introduce higher order co-occurrences in corpus linguistics and describe in section 4 how we adapt this idea to calculate higher order co-occurrences of learning objects. We then present the corresponding investigations using the MACE and the Travel Well dataset as test-beds in order to illustrate our ideas in section 5 and give a summary and an outlook in section 6. 2. RELATED WORK. Traditionally, representations describing the content of a data object are needed to find semantically similar objects. There exist several approaches to create such profiles and append semantic features to different types of data objects. Some approaches are based on the manual creation of such data, others on the automatic extraction of semantic features. A common approach to automatically extract semantic features from text documents is based on the idea that the content of a text can be represented by a list of characteristic keywords. Thus, by extracting keywords one can construct a shallow semantic representation of texts. A commonly applied measure for extracting keywords is the TF-IDF [3] measure which is based on the assumptions that on the one hand, the more often a term occurs in a document the more representative it is, but that on the other hand, the more often a term occurs in the entire collection of all documents the less relevant it is to discriminate documents. That is, keywords shall both be representative and discriminative. However, a lot of learning object collections do not only contain text documents but also images, videos or audio files. Thus, additional automatic content extraction methods for different media types are required. Pictures for instance can be analyzed using content-based image retrieval (CBIR) methods [4] which take into account the actual content of an image represented by the features color, shape, texture or similar content-related features. The information is extracted using automatic image processing algorithms. Since the extracted information is on a very low level, the results mostly contain only limited semantic information not matching the user’s search queries. Given an image of a specific person, CBIR methods can detect that a person is shown in the picture but can usually not identify who it is. With images showing complex scenes the identification and extraction of semantic features often fails [5]. An alternative to the automatic extraction of semantic features is manually created information. Traditional environments like libraries use the expertise of librarians or archivists to create metadata about their resources. These metadata typically contain information such as title, author and further classifications. Since this procedure requires manual creation of the metadata for all resources, it is lengthy and time consuming and also requires a lot of expertise. Further, maintenance of the data is problematic as well. One way to avoid such problems can be the use of social metadata like ratings, tags or comments about learning objects which are created by a community. The data can be used to create different views on the resources, e.g. by filtering out content which is tagged with a specific keyword or only displaying content which is frequently used or highly rated. Tags in particular provide an effective way to represent user interests and help the user to find documents about a specific topic [6]. Thus, this approach can effectively be used for different media types like images, videos or audio files where it is still difficult to automatically extract the appropriate semantic features [7]. A disadvantage of using such social metadata is that it has to be added by a community and thus often contains ambiguous or synonymic tags which make a comparison difficult. Further, it is not assured that each tag is assigned correctly which can lead to wrong results. Therefore, it seems to be preferable to find a new way to cluster semantically similar objects without considering the content but only their usage. Collaborative filtering approaches [8] employ this method by taking the relations between users and objects into account using implicit and explicit feedback, e.g. if a user bought a product or listened to a song. However, collaborative filtering approaches are not suitable to cluster semantically similar data objects but instead try to overcome semantic niches. Our approach does not rely on the relations between users and objects but on the relations between the objects themselves, i.e. whether they are used in similar contexts. First approaches for clustering objects based on their interrelations were mostly conducted in the area of web mining. Rongfei et al. [9] create object vectors containing the most significant co-occurrences of the respective objects. Thereafter a DBSCAN algorithm is used to cluster the objects based on these vectors. Smith and Ng [10] follow a related approach. They use the sessions an object occurred in to describe it. First, the sessions are clustered into transaction groups. An object (which is represented by an URL) is then characterized in terms of the transaction groups, e.g. an URL was called in seven sessions that belong to transaction group A and in five sessions that belong to group B. The object vectors are then used as input for a self-organizing map for clustering them. In this paper we present an approach where not only the first order co-occurrences of an object are taken into account but higher order co-occurrences as well to cluster semantically related objects. 3. BACKGROUND OF HIGHER ORDER CO-OCCURRENCE CLUSTERING. 3.1 Co-occurrences in Corpus Linguistics. If you want to know the meaning of a word, it is a good strategy to look it up in a dictionary where in some cases you might find a helpful definition. However, in most cases a definition (if there is one at all) might not be sufficient to correctly understand the word’s meaning as the word's context is often needed for clarification. The words strong and powerful, for example, have highly related meanings. However, we can say strong tea while we cannot say powerful tea. Powerful drug though is acceptable [11]. Definitions of the word’s meanings will most probably not cover such differences. Therefore, dictionaries usually give contexts in which a word typically occurs to illustrate the actual word usage. Context is considered to be significant for the meaning of a word. Firth [12] says “you shall know a word by the company it keeps”. (See [13] for an overview on the linguistic tradition related to Firth.) The company a words keeps – its co-occurring words – contributes to its meaning. Two words might just co-occur by accident. However, the co-occurrence might also be relatively frequent and thus statistically significant. Statistically significant co-occurrences reveal close relationships between the cooccurring words or their meanings, respectively: they are used to detect multi-word expressions (New York), idioms (kick the bucket [14]) or constructions with a milder idiomatic character (international best practice [11]). When calculating the co-occurrences of a word, one can consider different definitions on how near two words must be to co-occur depending on the purpose of the analysis. It is possible to only consider the direct neighbors of a word, as in the examples above, to use a static frame of words or to consider the whole sentence. The significant co-occurrences form the co-occurrence class of the respective word. For example, the co-occurrence class of the word dog is made up of the words bark, growl and sniff among others. It can then be examined whether words significantly co-occur in co-occurrence classes. These words again form another cooccurrence class, namely a higher order co-occurrence class. For instance, feed and dog are co-occurrences, as well as feed and cat. Therefore, dog and cat are second order co-occurrences. After some iterations the elements in the higher order cooccurrence classes become stable and semantic homogenous. Heyer et al. [15] show this for the co-occurrences of IBM, among other words. Their investigations are based on text corpora collected for the portal wortschatz.uni-leipzig.de (concerning the German treasury of words). The first co-occurrence class is rather heterogeneous, containing words like computer manufacturer, stock exchange, global and so on. After some iterations of computing higher order co-occurrence classes, however, the classes become more homogenous and stable. The co-occurrence class of tenth order only contains names of other computer-related companies like Microsoft, Sony etc. 3.2 Significance of Co-occurrences. 3.2.1 Calculation of Significance Values. When calculating the significance of a co-occurrence, its frequency is not sufficient as a measure, but the marginal frequencies of the individual words also have to be taken into account. For example, the bigram is to is one of the highest recurrent word pairs in the Brown corpus with 260 occurrences. However, the word is occurs about 10.000 times and the word to occurs about 26.000 times in the corpus that contains altogether 1 Million words. Therefore, even if words were sequenced randomly we would expect them to co-occur about 260 times together and they are not considered as significant co-occurrences [14]. We assume that the same holds true for the distribution of objects in sessions and apply the same measures as in corpus linguistics to calculate significant co-occurrences of learning objects. There are several measures that can be used to calculate how strong words or objects are attracted by each other. These measures can be divided into measures of effect size that calculate how much the observed co-occurrence frequency exceeds the predefined expected frequency (e.g. MI, Dice, odds ratio) and measures of significance that measure how unlikely the null hypothesis is that the words or objects are independent (e.g. zscore, t-score, simple-ll, chi-squared, log-likelihood). For more details see [16] where more than 30 different measures are discussed. In the following, we present a measure of significance that is based on the Poisson distribution. We choose this measure as it was already successfully applied in corpus linguistics for calculating higher order co-occurrences of words, e.g. by Heyer et al. [15]. The comparison by Bordag of the performance of different significance measures for co-occurrences, namely DICE coefficient, Mutual Information measure, Lexikographers Mutual Information, t-score, z-score, two log-likelihood based measures and two Poisson distribution based measures, supports this choice [17]. Furthermore, a formal proof justifying the assumption of a Poisson distribution for co-occurring words in a corpus, if the frequency of most words is much smaller than the corpus size, is given in [18]. Thus, following Heyer at al. we define the significance for the learning objects A and B based on the Poisson distribution as: (FORMULA_1). a = frequency of contexts containing A b = frequency of contexts containing B n = frequency of all contexts Under the assumptions ఒ > 2.5 and k > 10, with k being the frequency of all contexts containing A and B together, formula (1) can be simplified: (FORMULA_2). 3.2.2 Detection of a Suitable Threshold. There are various ways of deciding whether a co-occurrence is significant or coincidental, e.g. by ranking or by using a threshold. Ranking means that the co-occurrences are sorted by their significance values and only the best n co-occurrences are selected. When using a threshold, only co-occurrences with a significance value higher than the threshold are selected. However, there is no standard scale of measurement to draw a clear distinction between significant and non-significant co-occurrences [14]. Therefore the calculation of a suitable n or a suitable threshold (depending on the approach) is an exploratory investigation. 4. HIGHER ORDER CO-OCCURRENCE CLUSTERS. Sessions are used as input for the calculation of higher order co-occurrences. If timestamps are available, a session is made up of all learning objects accessed by a user without a break longer than an hour between two accesses. After a break of at least an hour, a new session starts. Currently, there are no further lower or upper limits for the size of a session. If only a date is stored for an action, a session comprises all activities of a user at one day. Duplicate learning objects in the sessions are deleted. For example, in the session the object A was accessed once and the object B was accessed twice. Without deletion, we would consider two co-occurrences between A and B that aroused from one context and this would contort the further calculations. After the sessions are created, they are taken as input to generate significant first order co-occurrences. For calculating the significance of the co-occurrences we use the formula given in section 3.2.2. All significant co-occurrences of a learning object together form its first order co-occurrence class. The first order co-occurrence classes of all learning objects serve as input for the calculation of the second order co-occurrence classes which then form the input for the calculation of the significant third order cooccurrence classes and so forth. To clarify this with an example, let’s take the following sessions S1, S2, and S3 as input: S1 = S2 = S3 = . The calculation of the first order co-occurrences results in the following (temporary) classes for A, B, C, D, E, and F. The total frequencies of the objects and their co-occurrences are given in brackets, e.g. A and D co-occur 2 times, whereas B occurs 2 times and D 3 times in total. The given frequencies are used to calculate the significance values for each co-occurrence based on the Poisson distribution (see section 3.2.1). For demonstration we use exemplary significance values in this example. (For successfully applying the presented measure of significance, a large collection of sessions is required.) The significance values of the co-occurrences are given in brackets. Given a threshold of 1.3, the final co-occurrence clusters can be calculated by deletion of all co-occurrences with a significance value lower than the threshold. First order co-occurrence class for A: First order co-occurrence class for B: First order co-occurrence class for C: First order co-occurrence class for D: First order co-occurrence class for E: First order co-occurrence class for F: The generated classes can now be used to calculate the second order co-occurrences. This is done the same way as calculating first order co-occurrences but taking the first order co-occurrence clusters as input and not the sessions. This leads to the following second order co-occurrences for A, B, C, D (please note that the second order co-occurrence classes for E and F are empty, so they are not shown here). The total frequencies of the co-occurrences and exemplarily significance values are given in brackets. Using again the threshold of 1.3, the following second order co-occurrence classes arise: Second order co-occurrence class for A: Second order co-occurrence class for B: Second order co-occurrence class for C: Second order co-occurrence class for D: These classes can now be used as input for the calculation of third order co-occurrences and so forth. The calculation stops when the classes get stable, i.e. they do not change anymore in further iterations. 5. CLUSTERING RESULTS. 5.1 Testbeds. 5.1.1 MACE. The MACE (Metadata for Architectural Contents in Europe) project relates digital learning resources about architecture, stored in various repositories, with each other across repository boundaries to enable new ways of finding relevant information [19]. While interacting with the MACE portal, users are monitored and their activities are recorded as CAM (Contextualized Attention Metadata, [20], [21]) instances. Activities include search, access and metadata provision activities like tagging and rating. The CAM instances used for the evaluation were collected from September 2009 until April 2010 and comprise at least a timestamp as well as a user identifier and an item identifier. The actions considered for a learning object to be part of a session are goToPage, i.e. the user leaves the MACE portal to access the original learning object, and getMetadataForContent, i.e. the user accesses the metadata of a learning object at the MACE portal, e.g. its ratings or user tags. For the calculation of the higher order co-occurrence clusters, we were thus able to take 46.641 events into account that took place in 2.449 sessions. On average, about 11 learning objects were accessed per session. Overall, 3710 distinct learning objects were accessed [1]. MACE stores the metadata representations of the learning objects on a central server. The representations base on the MACE application profile which in turn is based on the Learning Object Metadata (LOM) standard [22]. The MACE application profile comprises several categories that are used to specify a learning object in more detail, such as the general category where basic information about the learning object is stored and the annotation category where comments about a learning object's usefulness for education and the comments’ origins can be stored. We calculate the metadata-based similarity of all pairs of learning objects to get a reference value for evaluating the higher order co-occurrence clusters. Each learning object holds one or more titles and descriptions and is marked with the learning resource types it comprises, e.g. a text containing a figure is marked with the learning resource types narrative text and figure. As these terms belong to a controlled vocabulary, they are comparable. MACE also offers users and domain experts the possibility of editing parts of the metadata, namely tags, classifications and competencies. Tags are free text and can be assigned to learning objects by logged in users. Classifications and competencies are each defined in a controlled vocabulary and can only be set by domain experts. The classification vocabulary is a taxonomy consisting of 2884 terms. The competency vocabulary contains 107 terms to describe the suitability of learning objects for the acquisition of special competencies, e.g. Knowledge of internal environment control and Understanding interaction between technical and environmental issues. To compare the MACE learning objects, document vectors describing them are generated by considering the following assortment of available information: titles, descriptions, learning resource types, user tags, classifications and competencies. Before calculating the metadata similarity, the titles and descriptions are pre-processed. After removing stop words the remaining words undergo a stemming using the Snowball Stemmer [23]. The metadata-based similarity is then calculated using the cosine similarity, i.e. measuring the similarity between two vectors by calculating the cosine of the angle between them [24]. 5.1.1 Travel Well. The dataset [2] was collected on the Learning Resource Exchange (LRE) portal that makes open educational resources available from more than 20 content providers in Europe and elsewhere. These learning resources exist in multiple languages and conform to a variety of national and local curricula. The registered users, mostly primary and secondary teachers, come from a number of different European countries. The dataset contains information about the rating and tagging behavior of 98 registered users over a period of six months (August 2008 – February 2009). For each user activity, the date, user id, item id and the tag, respectively the rating is stored. As there is no timestamp but only the date, a session comprises all activities conducted by a user in one day. Overall, 14248 events took place in 255 sessions where each session comprises 55 distinct learning objects on average. 75 users rated 1698 unique objects on a scale of 1 to 5 for usefulness; each of these objects was rated 1.3 times on average. Additionally, 79 users tagged 1838 unique objects with 12041 tags in total; consequently each object was assigned with 6.5 tags on average. Information that is available about the users and learning objects is e.g. mother tongue, spoken languages, and subjects the user is interested in as well as title, metadata provider, language, classification keywords, and intended end user age for the learning objects. Similar to the MACE dataset, we calculate the metadata-based similarity of all object pairs to get a reference value for evaluating the higher order co-occurrence clusters. We do so by taking the classification keywords and the tags into account. Since an item cannot be tagged more than once with the same keyword, we created a binary vector for each item and used the Tanimoto coefficient [25] for calculating the metadata-based similarity. 5.2 Usage-based Clustering: Results. Before clustering, we need to decide how to distinguish whether a co-occurrence is significant or not. This can be done by ranking and choosing the first n co-occurrences or by using a threshold. Since in the given datasets some learning objects can have more learning objects they are similar to than others, we decided to use a threshold. The calculation of a suitable threshold is an exploratory investigation. One possible indication for the quality of a threshold without considering the content is the cluster size and the amount of clusters. Another way to test the quality of a threshold is to manually check some clusters for their semantic consistency. Additionally, even if only available for some learning objects, semantic metadata describing the objects can be used to automatically train a suitable threshold. Here, we used the MACE dataset including semantic metadata to find the best fitting threshold to create meaningful and semantically consistent clusters, which is 1.55. The use of higher thresholds resulted in a lot of very small clusters and the use of lower thresholds resulted in a few small and a few very big clusters (more than 1000 learning objects). Additionally, we used the semantic metadata to manually and automatically validate this choice. We then applied this threshold directly on the Travel Well dataset to test if the threshold is transferable or must be trained new for each environment. 5.2.1 MACE. Using 1.55 as threshold, the higher order co-occurrence classes became stable after the fifth iteration. This means that in further iterations the classes did not change anymore. The calculations resulted in 184 clusters that contain on average 63 learning objects. Figure 1. Cluster size distribution for MACE. The smallest cluster contains three learning objects and the biggest one 719 learning objects. About 78% of the clusters contain 30 learning objects at maximum and only 17% of the clusters contain more than 100 learning objects (see Fig. 1). Figure 2. Learning object distribution for MACE. The higher order co-occurrence based clustering is not a hard clustering. Therefore, a learning object can belong to more than one cluster, however, about 70% of the learning objects belong to 3 clusters at maximum and there is no learning object that belongs to more than 9 clusters (see Fig. 2). 5.2.2 Travel Well. As for MACE, the usage classes became stable after the fifth iteration using the threshold of 1.55. The 1838 learning objects were clustered into 100 clusters that contain on average 142 learning objects, see Fig. 3. Figure 3. Cluster size distribution for Travel Well. The smallest cluster contains 6 learning objects and the biggest one 420. It is noticeable that the clustering of the Travel Well dataset, where each learning object is contained in about 7 clusters on average (see Fig. 4), resulted in clearly bigger clusters than the MACE clustering. This is due to the fact, that the sessions in Travel Well which contain 55 objects on average are significantly larger than in MACE, where a session only contains 11 learning objects on average. The Travel Well sessions might be this large as there are no timestamps in the dataset but only the date and therefore all activities conducted by a user in a day form one session independent of things such as potential learning breaks. Figure 4. Learning object distribution for Travel Well. 5.3 Evaluation. The following evaluation answers two questions: (a) Do all the learning objects of the same clusters have a significantly different relatedness than learning objects randomly drawn from the set of all learning objects? (b) If so, does the clustering lead to a significantly lower or higher relatedness and what is the relation of the significantly improved or worsened clusters? These issues will be resolved by applying two statistical methods, namely the Kruskal-Wallis-Test [26] for the first question and a test for the student-t-distribution [27] for the second one. On a first glance the ANOVA [29] seems appropriate to check whether the clustering does have an overall effect. Unfortunately, this approach must be dropped as the ANOVA does have preconditions on the data that were not met. For one, semantic relatedness is not normally distributed as a Kolmogorov-Smirnov Test revealed [28]; secondly the homogeneity of variances within clusters is not given. Semantic relatedness tends to be power-law distributed. Fig. 5 and Fig. 6 which show the distribution of the pair-wise metadata-based similarities in MACE und Travel Well support this claim. Figure 5. Distribution of the pair-wise metadata-based similarities in MACE. Because of these reasons, the non-parametric alternative for the ANOVA, the Kruskal-Wallis-Test [26] was chosen. Kruskal-Wallis is based on ranked data and does not make such strong assumptions as an underlying normal distribution or homogeneity of variances. Figure 6. Distribution of the pair-wise metadata-based similarities in Travel Well. As for the second question and approach, i.e. whether the clustering leads to an improvement of the semantic relatedness, it is assumed that the means of the relatedness values within clusters at least t-distribute around the overall mean-value of the population of MACE respectively Travel Well objects. This is a known fact based on the central limit theorem, according to which the means of samples tend to be normally distributed around the overall mean of the population if the sample size is equal to or greater than 30. For smaller samples the means are student-tdistributed. The according values can be easily computed according to formula 3: (FORMULA_3). In this case we take advantage of the fact that the specific values of the population, i.e. mean and standard deviation, can directly be computed and thus do not have to be estimated. This leads to formula 4: (FORMULA_4). Here, the estimated standard error is replaced by the computed standard error for the population. To check whether several clusters deviate significantly from the mean of the population the relevant t-value has to be calculated and must then be checked against the t-value needed for significance. As this is not a standard t-test, alpha-errors that would bias the results by chance, do not occur [30]. 5.3.1 Kruskal-Wallis-Test. The Kruskal-Wallis-Test evaluates whether the medians of certain groups’ ranked data (dependent variable) differ systematically or not. Therefore, the test distinguishes between the entire set of objects and the cluster set of the higher order co-occurrence clustering approach, i.e. it is tested whether the semantic relatedness values of the respective clustering – all partition sets taken together – differ significantly from the overall median of relatedness of the entire MACE respectively Travel Well set. The respective Null-hypothesis (H0) can be formulated as such: H0: The learning objects are taken from the same population. Higher order co-occurrence based clustering does not have an effect on semantic relatedness. Table I displays the result of the tests. It shows that the clustering does have a significant impact on semantic relatedness. Table 1. Kruskal-Wallis-Test. For both datasets, the H0 must be rejected, H1 accepted: H1: The learning objects are not taken from the same population. The higher order co-occurrence clustering has an effect on semantic relatedness. 5.3.2 Test for Student-t Distribution. The Kruskal-Wallis-Test tests whether the entire clustering has an effect on semantic relatedness. What remains is to compare the individual cluster sets with the overall set of all objects. As in our case the population of all MACE respectively Travel Well objects is available, no additional post-hoc tests must be conducted. To compare individual cluster sets with the overall set, it can simply be tested whether the cluster means of semantic relatedness differ significantly from the overall mean of the population. To this end, the null hypothesis for each single cluster set with the same sample size is formulated: H0: The mean semantic relatedness of a specific cluster set does not differ from the mean semantic relatedness of the whole population of learning objects. H1: The mean semantic relatedness of a specific cluster set systematically differs from the mean semantic relatedness of the whole population. The above described Student t-test to evaluate H0/1 was applied to every cluster set of the clustering. We did a two-sided test on a 5%-significance-level for both datasets, see table II. Table 2. t-distribution test. The semantic density of the clear majority of cluster sets in the MACE dataset as well as in the Travel Well dataset is significantly higher than the semantic density of the overall population (where semantic density is defined as the mean value of semantic relatedness). This means, the chance that the included learning objects show a higher similarity than at a random draw is about 80% for MACE respectively 96% for Travel Well. This is a very good result su"

About this resource...

Visits 113

0 comments

Do you want to comment? Sign up or Sign in