Page:The World Within Wikipedia: An Ecology of Mind.pdf/4

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.
Information 2012, 3
232



However, traditional models such as LSA are based solely in language structure, and so they do not model the mutual influence between cognition and language. This is partly because the available environments for such models have been entirely linguistic, e.g., text-dumps of books, newspapers, and other abundant sources of text. In contrast, the advance of the Internet has given rise to data sets that are created and organized in novel ways that reflect human conceptual/categorical organization. Wikipedia is the prototypical example of this new breed of cognitive-linguistic environment. It is read and edited daily by millions of users[1]. As an online encyclopedia, Wikipedia is structured around articles pertaining to concept-specific entries. Additionally, Wikipedia’s structure is augmented by hyperlinks between articles and other kinds of pages such as category pages, which provide loose hierarchical structure, and disambiguation pages, which disambiguate entries with exact or highly similar names. Using Wikipedia as a cognitive-linguistic environment, a computational model that incorporates both the mutual influences of conceptual/categorical organization and the structure of language should produce behavior closer to human behavior than a model without such mutual influence.


Several researchers have already used Wikipedia’s structure in models that emulate human semantic comparisons[2] [3] [4]. In this paper we extend their work in two significant ways. First, rather than focus on a single type of structure, e.g., link structure or concept structure, we present a model that utilizes three levels of structure: Word-word, word-concept, and concept-concept (W3C3) to more fully represent the cognitive-linguistic environment of Wikipedia. As we will show in the following sections, each of these levels independently contributes to an explanation of human semantic behavior. Secondly, in addition to the common dataset considered by previous researchers using Wikipedia, the WordSimilarity-353[5] dataset, we apply the W3C3 model to a wider array of behavioral data, including word association norms[6], semantic feature production norms[7], and false memory formation[8]. Studies 1 to 4 examine how the W3C3 model manifests language structure and categorization effects across this wide array of behavioral data. Our analysis suggests that, at multiple levels of structure, Wikipedia reflects the aspects of meaning that drive semantic associations. More specifically, meaning is reflected in the structure of language, the organization of concepts/categories, and the linkages between them. Our results inform the internalist/externalist debate by showing just how much internal cognitive-linguistic structure used in these tasks is preserved externally in Wikipedia.


2. Semantic Models


In the following sections we present three approaches that when applied to Wikipedia extract models of semantic association at three different levels. The first model, the Correlated Occurrence Analogue to Lexical Semantics[9], operates at a word-word level. The second model, Explicit Semantic Analysis[10],[11] , operates at a word-concept level. The third and final model, Wikipedia Link Measure[12]., operates at a concept-concept level. We then describe a joint model (W3C3) that trivially combines these three models.


2.1. Correlated Occurrence Analogue to Lexical Semantics


The Correlated Occurrence Analogue to Lexical Semantics (COALS) model implements a sliding window strategy to build a word by word matrix of normalized co-occurrences[13]. Because the

  1. Wikipedia. Wikipedia: Statistics. 2007. Available online: http://en.wikipedia.org/wiki/ (accessed on 8 February 2011).
  2. Gabrilovich, E.; Markovitch, S. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 1606–1611.
  3. Gabrilovich, E.; Markovitch, S. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 2009, 34, 443–498.
  4. Milne, D.; Witten, I.H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, Chicago, IL, USA, 13–14 July 2008; AAAI Press: Chicago, IL, USA, 2008; pp. 25–30.
  5. Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. ACM Trans. Inf. Syst. 2002, 20, 116–131.
  6. Nelson, D.L.; McEvoy, C.L.; Schreiber, T.A. The University of South Florida word association, rhyme, and word fragment norms, 1998. Available online: http://www.usf.edu/FreeAssociation/ (accessed on 12 June 2011).
  7. McRae, K.; Cree, G.S.; Seidenberg, M.S.; McNorgan, C. Semantic feature production norms for a large set of living and nonliving things. Behav. Res. Methods 2005, 37, 547–559; PMID: 16629288.
  8. Roediger, H.L.; Watson, J.M.; McDermott, K.B.; Gallo, D.A. Factors that determine false recall: A multiple regression analysis. Psychon. Bull. Rev. 2001, 8, 385–407.
  9. Rohde, D.; Gonnerman, L.; Plaut, D. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Unpublished manuscript, 2005.
  10. Gabrilovich, E.; Markovitch, S. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 1606–1611.
  11. Gabrilovich, E.; Markovitch, S. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 2009, 34, 443–498.
  12. Milne, D.; Witten, I.H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, Chicago, IL, USA, 13–14 July 2008; AAAI Press: Chicago, IL, USA, 2008; pp. 25–30.
  13. Rohde, D.; Gonnerman, L.; Plaut, D. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Unpublished manuscript, 2005.