ESA and LSA Disambiguation

This page describes the implementations of Explicit Semantic Analysis and Latent Semantic Analysis for DBpedia-Spotlight. Both of these techniques try to capture latent (non-explicit) associations between terms and documents in a corpus, albeit in quite different ways. Both techniques are also useful in eliminating noise from data and compressing large corpora.

##Explicit Semantic Analysis The basic intuition behind explicit semantic analysis is that Wikipedia pages represent concepts, and a document can be represented as a vector of Wikipedia concepts with weights indicating their importance or relatedness to the document. In the ESA implementation for DBpedia-Spotlight, we replace Wikipedia pages with DBpedia Resources, but instead of using the text on each Wikipedia page, we use the aggregated context obtained by taking the context around each occurrence of the resource in Wikipedia.

The tfidf index that is the starting point for ESA is created by this pig script.

This disambiguation method requires two special indexes:
- an Inverted Index (token --> resources) which stores each resource where the token (key) occurs, along with its tf-idf weight for that resource
- a Vector store (resource --> resources) which stores a vector for each resource, where the weights are obtained by taking the centroid of all of the token-->resources vectors from the context of that resource (provided by the inverted index).
if this is confusing, just think of a matrix with docs as rows, and resources as columns, instead of tokens as columns as in classic tfidf.

In practice, the EsaVectorStore quickly grows very large if no filtering is implemented - thus the indexes had to be filtered at several points to keep things at a manageable size. More on this in the full report.

How to run the EsaIndexing example:
First download these directories, maintaining the file structure.
Decompress all compressed files and place both directories in the same location, but maintain the internal directory structure so that you don't have to change any paths in the code.

Clone this fork of dbpedia-spotlight-db:

git clone [email protected]:chrishokamp/dbpedia-spotlight-db

cd dbpedia-spotlight-db 
git checkout esaAndLsa    
mvn install -DskipTests=true

Now you can either load into your IDE (import maven project), or run from command line. You'll need at least 14Gb of memory (set -Xmx14g) for the test to run without paging.
From command line:

cd index         
mvn scala:run -DmainClass=org.dbpedia.spotlight.db.IndexEsa "-DaddArgs=/path/to/the/files/ESAandLSA/"

-- Indexing is set to run with only the top 15 resources/term, and only the top 60 resources/resource so that memory requirements are not too high. This is just a demonstration, and does not show the best performance of the system. -- Every 25,000 resources, you'll also see a ranked list of the most important dimensions for the particular resource. This demonstrates another very useful aspect of ESA - the capacity to rank relevant dimensions in a human-readable format.

##Latent Semantic Analysis(LSA) LSA uses the Singular Value Decomposition (SVD) to greatly reduce the dimensions in a matrix by 'folding' similar dimensions together. However, one problem with pure LSA is that new documents and queries must also be folded into the reduced-dimensional space, which requires storing the V_t matrix of term vectors and multiplying each query vector with this matrix. For large corpora, this quickly becomes infeasible, and it cancels out the main benefit of LSA, which is compressing very large matrices to more manageable sizes.

One way of going around this is to conceptualize each resource as the centroid of its terms. This is very efficient in LSA, because all term vectors have the exact same dimensionality, and all dimensions are non-zero in general. After many failures at scaling up the vector folding process to the full corpus, the centroid approach was implemented as a test. The demo here has the top ten dimensions from the SVD, but its performance is only slightly better than random.

Note that the dimensions in V_t are scaled by SIGMA^(0.5). Mahout's SSVD has a command-line option to output V in this format, although it can also be done post-SVD.

The instructions for running the LSA demo are the same as above, except the main class is IndexLsa.

cd index         
mvn scala:run -DmainClass=org.dbpedia.spotlight.db.IndexLsa "-DaddArgs=/path/to/the/files/ESAandLSA/"

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESA and LSA Disambiguation

Clone this wiki locally