where υij is the jth component of vector υi, normalized by the length of the tf-idf vector. As a result,
all ESA vectors have a length of 1. Similarity between terms is computed as the vector cosine between
term vectors.
Because Wikipedia is a rather large corpus with many unique terms, ESA approaches have used
pruning heuristics to reduce the space to a more manageable size[1],[2]. Pruning heuristics include
removing articles with less that 100 words, removing articles with fewer than five total links (inlinks
and outlinks), removing high frequency words (stop words), removing rare words, and transforming the
remaining words into their approximate uninflected root form (a process called stemming).
2.3. Wikipedia Link Measure
Previous work has used the link structure of Wikipedia to derive a semantic similarity measure using
the Wikipedia Miner toolkit[3]. The basic premise of this approach is that every Wikipedia page
has pages that link to it (inlinks) as well as pages it links to (outlinks). The links are inline with the text,
meaning that a word or phrase in the text has been hyperlinked. The words/phrases themselves are called
anchors, and they can be viewed as synonyms for the target page, e.g., motor car, car, and automobile
link to the Wikipedia page Automobile. The Wikipedia Link Measure (WLM) uses anchors to map input
text to Wikipedia articles and then uses the inlinks and outlinks of these articles to derive the similarity
between words[4]. Because the meaning representation of each word is defined by the links between its
associated concept and other concepts, WLM matches our definition of a concept-concept level model.
Consider two articles, Football and Sport. For a particular link type, say inlinks, and these two
articles, we can place all other articles in Wikipedia into one of four categories as shown in Table 2.
The frequencies in Table 2 are hypothetical, but they serve to illustrate a common structure. First, the
number of links shared by two articles are likely to be relatively small compared to the number of links
they possess individually. Secondly, the number of links that neither has is likely to be very large and
relatively close to the total number of articles.
Table 2. Hypothetical inlink overlap for Football and Sport.
Sport | |||
Has Link | No Link | ||
Football | Has Link | 4 | 3450 |
No Link | 563 | 3 million |
Intuitively, two articles that share inlinks and outlinks are likely to be similar; however, to the extent
that some links may be common across many articles they should be weighted less. This intuition is
captured in the WLM outlink metric, which weights each outlink o by log(|A|/|O|), the log of the
number of total articles in Wikipedia |A| divided by the number of articles that link to that page |O|.
Weighted outlink vectors are constructed based on the union of outlinks for two articles, and the outlink
- ↑ Gabrilovich, E.; Markovitch, S. Computing Semantic Relatedness UsingWikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 1606–1611.
- ↑ Gabrilovich, E.; Markovitch, S. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 2009, 34, 443–498.
- ↑ Milne, D.; Witten, I.H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, Chicago, IL, USA, 13–14 July 2008; AAAI Press: Chicago, IL, USA, 2008; pp. 25–30.
- ↑ Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. ACM Trans. Inf. Syst. 2002, 20, 116–131.