In this paper, we introduce a new way of detecting semantic similarities between learning objects by analyzing their usage in a web portal. Our approach does not rely on the content of the learning objects or on the relations between the users and the learning objects but on usage-based relations between the objects themselves. The technique we apply for calculating higher order co-occurrences to create semantically homogenous clusters of data objects is taken from corpus driven lexicology where it is used to cluster words. We expect the members of a higher order co-occurrence class to be similar according to their content and present the evaluations of that assumption using two teaching and learning systems.
"1. INTRODUCTION. In this paper we present a new way to cluster learning objects into semantically homogenous groups without considering their content or additional semantic metadata. We do so by just taking the usage of the learning objects, i.e. the interaction of the learners with the objects into account. For this purpose, we borrow the technique of calculating higher order co-occurrences from corpus linguistics where it is used to cluster semantically similar words and apply it to the usage of learning objects. In linguistics, first order co-occurrences of a word can be calculated by taking the context of that word into account, e.g. the sentences in which it occurs. Second order co-occurrences can be calculated by taking the co-occurrences of the first order co-occurrences into account and so forth. For example: drink and beer are co-occurrences, as well as drink and wine. Therefore, beer and wine are co-occurrences in the first order co-occurrence class of drink, this means beer and wine are second order co-occurrences. Higher order co-occurrence classes of words tend to be semantically homogenous; this is to say they are similar according to specific attributes, e.g. their direct hypernym. While in linguistics the basic unit is a word that is used in sentences, we transfer this approach to usage data and consider learning objects as basic units that are used in sessions. Therefore, the context of a learning object consists of all sessions the object was accessed in and a session consists of all objects that were accessed in that session. Thus, if two learning objects were used in the same session, they are said to be co-occurrences. In this paper we address the question whether higher order co-occurrence classes of learning objects become semantically homogenous, like the analogue classes of words. This is a non-trivial question; it is by no means necessary that the semantic convergence of words in higher order co-occurrence classes can also be observed for entire learning objects. However, the possibility to derive semantic similarity from usage is highly promising for diverse applications of information retrieval in learning settings and thus the question is worth solving. We evaluate the potential of this approach using the usage data collected in the systems MACE [1] and Travel Well [2]. The rest of the paper is structured as follows: In section 2 we briefly report on the related work in the area of clustering semantically similar data objects such as documents and pictures. In section 3 we introduce higher order co-occurrences in corpus linguistics and describe in section 4 how we adapt this idea to calculate higher order co-occurrences of learning objects. We then present the corresponding investigations using the MACE and the Travel Well dataset as test-beds in order to illustrate our ideas in section 5 and give a summary and an outlook in section 6. 2. RELATED WORK. Traditionally, representations describing the content of a data object are needed to find semantically similar objects. There exist several approaches to create such profiles and append semantic features to different types of data objects. Some approaches are based on the manual creation of such data, others on the automatic extraction of semantic features. A common approach to automatically extract semantic features from text documents is based on the idea that the content of a text can be represented by a list of characteristic keywords. Thus, by extracting keywords one can construct a shallow semantic representation of texts. A commonly applied measure for extracting keywords is the TF-IDF [3] measure which is based on the assumptions that on the one hand, the more often a term occurs in a document the more representative it is, but that on the other hand, the more often a term occurs in the entire collection of all documents the less relevant it is to discriminate documents. That is, keywords shall both be representative and discriminative. However, a lot of learning object collections do not only contain text documents but also images, videos or audio files. Thus, additional automatic content extraction methods for different media types are required. Pictures for instance can be analyzed using content-based image retrieval (CBIR) methods [4] which take into account the actual content of an image represented by the features color, shape, texture or similar content-related features. The information is extracted using automatic image processing algorithms. Since the extracted information is on a very low level, the results mostly contain only limited semantic information not matching the user’s search queries. Given an image of a specific person, CBIR methods can detect that a person is shown in the picture but can usually not identify who it is. With images showing complex scenes the identification and extraction of semantic features often fails [5]. An alternative to the automatic extraction of semantic features is manually created information. Traditional environments like libraries use the expertise of librarians or archivists to create metadata about their resources. These metadata typically contain information such as title, author and further classifications. Since this procedure requires manual creation of the metadata for all resources, it is lengthy and time consuming and also requires a lot of expertise. Further, maintenance of the data is problematic as well. One way to avoid such problems can be the use of social metadata like ratings, tags or comments about learning objects which are created by a community. The data can be used to create different views on the resources, e.g. by filtering out content which is tagged with a specific keyword or only displaying content which is frequently used or highly rated. Tags in particular provide an effective way to represent user interests and help the user to find documents about a specific topic [6]. Thus, this approach can effectively be used for different media types like images, videos or audio files where it is still difficult to automatically extract the appropriate semantic features [7]. A disadvantage of using such social metadata is that it has to be added by a community and thus often contains ambiguous or synonymic tags which make a comparison difficult. Further, it is not assured that each tag is assigned correctly which can lead to wrong results. Therefore, it seems to be preferable to find a new way to cluster semantically similar objects without considering the content but only their usage. Collaborative filtering approaches [8] employ this method by taking the relations between users and objects into account using implicit and explicit feedback, e.g. if a user bought a product or listened to a song. However, collaborative filtering approaches are not suitable to cluster semantically similar data objects but instead try to overcome semantic niches. Our approach does not rely on the relations between users and objects but on the relations between the objects themselves, i.e. whether they are used in similar contexts. First approaches for clustering objects based on their interrelations were mostly conducted in the area of web mining. Rongfei et al. [9] create object vectors containing the most significant co-occurrences of the respective objects. Thereafter a DBSCAN algorithm is used to cluster the objects based on these vectors. Smith and Ng [10] follow a related approach. They use the sessions an object occurred in to describe it. First, the sessions are clustered into transaction groups. An object (which is represented by an URL) is then characterized in terms of the transaction groups, e.g. an URL was called in seven sessions that belong to transaction group A and in five sessions that belong to group B. The object vectors are then used as input for a self-organizing map for clustering them. In this paper we present an approach where not only the first order co-occurrences of an object are taken into account but higher order co-occurrences as well to cluster semantically related objects. 3. BACKGROUND OF HIGHER ORDER CO-OCCURRENCE CLUSTERING. 3.1 Co-occurrences in Corpus Linguistics. If you want to know the meaning of a word, it is a good strategy to look it up in a dictionary where in some cases you might find a helpful definition. However, in most cases a definition (if there is one at all) might not be sufficient to correctly understand the word’s meaning as the word's context is often needed for clarification. The words strong and powerful, for example, have highly related meanings. However, we can say strong tea while we cannot say powerful tea. Powerful drug though is acceptable [11]. Definitions of the word’s meanings will most probably not cover such differences. Therefore, dictionaries usually give contexts in which a word typically occurs to illustrate the actual word usage. Context is considered to be significant for the meaning of a word. Firth [12] says “you shall know a word by the company it keepsâ€. (See [13] for an overview on the linguistic tradition related to Firth.) The company a words keeps – its co-occurring words – contributes to its meaning. Two words might just co-occur by accident. However, the co-occurrence might also be relatively frequent and thus statistically significant. Statistically significant co-occurrences reveal close relationships between the cooccurring words or their meanings, respectively: they are used to detect multi-word expressions (New York), idioms (kick the bucket [14]) or constructions with a milder idiomatic character (international best practice [11]). When calculating the co-occurrences of a word, one can consider different definitions on how near two words must be to co-occur depending on the purpose of the analysis. It is possible to only consider the direct neighbors of a word, as in the examples above, to use a static frame of words or to consider the whole sentence. The significant co-occurrences form the co-occurrence class of the respective word. For example, the co-occurrence class of the word dog is made up of the words bark, growl and sniff among others. It can then be examined whether words significantly co-occur in co-occurrence classes. These words again form another cooccurrence class, namely a higher order co-occurrence class. For instance, feed and dog are co-occurrences, as well as feed and cat. Therefore, dog and cat are second order co-occurrences. After some iterations the elements in the higher order cooccurrence classes become stable and semantic homogenous. Heyer et al. [15] show this for the co-occurrences of IBM, among other words. Their investigations are based on text corpora collected for the portal wortschatz.uni-leipzig.de (concerning the German treasury of words). The first co-occurrence class is rather heterogeneous, containing words like computer manufacturer, stock exchange, global and so on. After some iterations of computing higher order co-occurrence classes, however, the classes become more homogenous and stable. The co-occurrence class of tenth order only contains names of other computer-related companies like Microsoft, Sony etc. 3.2 Significance of Co-occurrences. 3.2.1 Calculation of Significance Values. When calculating the significance of a co-occurrence, its frequency is not sufficient as a measure, but the marginal frequencies of the individual words also have to be taken into account. For example, the bigram is to is one of the highest recurrent word pairs in the Brown corpus with 260 occurrences. However, the word is occurs about 10.000 times and the word to occurs about 26.000 times in the corpus that contains altogether 1 Million words. Therefore, even if words were sequenced randomly we would expect them to co-occur about 260 times together and they are not considered as significant co-occurrences [14]. We assume that the same holds true for the distribution of objects in sessions and apply the same measures as in corpus linguistics to calculate significant co-occurrences of learning objects. There are several measures that can be used to calculate how strong words or objects are attracted by each other. These measures can be divided into measures of effect size that calculate how much the observed co-occurrence frequency exceeds the predefined expected frequency (e.g. MI, Dice, odds ratio) and measures of significance that measure how unlikely the null hypothesis is that the words or objects are independent (e.g. zscore, t-score, simple-ll, chi-squared, log-likelihood). For more details see [16] where more than 30 different measures are discussed. In the following, we present a measure of significance that is based on the Poisson distribution. We choose this measure as it was already successfully applied in corpus linguistics for calculating higher order co-occurrences of words, e.g. by Heyer et al. [15]. The comparison by Bordag of the performance of different significance measures for co-occurrences, namely DICE coefficient, Mutual Information measure, Lexikographers Mutual Information, t-score, z-score, two log-likelihood based measures and two Poisson distribution based measures, supports this choice [17]. Furthermore, a formal proof justifying the assumption of a Poisson distribution for co-occurring words in a corpus, if the frequency of most words is much smaller than the corpus size, is given in [18]. Thus, following Heyer at al. we define the significance for the learning objects A and B based on the Poisson distribution as: (FORMULA_1). a = frequency of contexts containing A b = frequency of contexts containing B n = frequency of all contexts Under the assumptions à°’ > 2.5 and k > 10, with k being the frequency of all contexts containing A and B together, formula (1) can be simplified: (FORMULA_2). 3.2.2 Detection of a Suitable Threshold. There are various ways of deciding whether a co-occurrence is significant or coincidental, e.g. by ranking or by using a threshold. Ranking means that the co-occurrences are sorted by their significance values and only the best n co-occurrences are selected. When using a threshold, only co-occurrences with a significance value higher than the threshold are selected. However, there is no standard scale of measurement to draw a clear distinction between significant and non-significant co-occurrences [14]. Therefore the calculation of a suitable n or a suitable threshold (depending on the approach) is an exploratory investigation. 4. HIGHER ORDER CO-OCCURRENCE CLUSTERS. Sessions are used as input for the calculation of higher order co-occurrences. If timestamps are available, a session is made up of all learning objects accessed by a user without a break longer than an hour between two accesses. After a break of at least an hour, a new session starts. Currently, there are no further lower or upper limits for the size of a session. If only a date is stored for an action, a session comprises all activities of a user at one day. Duplicate learning objects in the sessions are deleted. For example, in the session the object A was accessed once and the object B was accessed twice. Without deletion, we would consider two co-occurrences between A and B that aroused from one context and this would contort the further calculations. After the sessions are created, they are taken as input to generate significant first order co-occurrences. For calculating the significance of the co-occurrences we use the formula given in section 3.2.2. All significant co-occurrences of a learning object together form its first order co-occurrence class. The first order co-occurrence classes of all learning objects serve as input for the calculation of the second order co-occurrence classes which then form the input for the calculation of the significant third order cooccurrence classes and so forth. To clarify this with an example, let’s take the following sessions S1, S2, and S3 as input: S1 = S2 = S3 =
About this resource...
Visits 113
Categories:
0 comments
Do you want to comment? Sign up or Sign in