Explicit Semantic Analysis (ESA) using Wikipedia

Semantic Relatedness Measure

The Explicit Semantic Analysis (ESA) method (Gabrilovich and Markovitch, 2007) is a measure to compute the semantic relatedness (SR) between two arbitrary texts. The Wikipedia-based technique represents terms (or texts) as high-dimensional vectors, each vector entry presenting the TF-IDF weight between the term and one Wikipedia article. The semantic relatedness between two terms (or texts) is expressed by the cosine measure between the corresponding vectors.

WikipediaESA Demo 

You can try out Wikipedia-based ESA with the WikipediaESA demo web application. The demo application allows you to compute the semantic relatedness for the 88,537 most common german words.

WikipediaESA .NET application/library

You can download the C# source code of my WikipediaESA library (including console application) below as a tar.gz archive. This program allows you to parse Wikipedia XML dumps and build ESA term vectors from them. You can use the resulting binary term vector files in combination with the library (.NET assembly) to compute semantic relatedness in your own application. WikipediaESA is released AS IS under GPL.

References

  • Gabrilovich, E. and Markovitch, S. (2007). "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis", Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007 PDF article

See also

Attachments:

  • WikipediaESA.tar.gz – WikipediaESA .NET application/library (C# source code) (93887 Bytes, modified 2008-11-16)
Created by Henning Jacobs
Guerra Creativa - Creative Crowdsourcing