Core Lucene based

This page describes the Lucene-based version of the DBpedia Spotlight core.

There are three "sort of" indices:

1- spotter dictionary "index": holds a collection of name variations that we will "mark" in text when we see them. In practice this is implemented, for example, as a prefix tree. We have also a few other implementations. See our Spotter Evaluations from late 2011

2- candidate mapping index: holds a mapping from name variation to URI, so that we know the possible meanings of each name variation (aka surface form). This has been implemented in two ways. 2.a) You can have a "CandidateIndex" based on Lucene's RAMDirectory, so that the "candidate mapping" is done very fast, and allows near-matches with fuzzy queries. We could also have other in-memory solutions for this, such as using hashmaps, but we never got around to implementing them. 2.b) The second option is to just stick the surface forms in the "ContextIndex" in another field, so that you can do candidate mapping and disambiguation in one step. This is what we use for ICF.

3- disambiguation context index: also known as ContextIndex, contains three main fields URI, SURFACE_FORM, CONTEXT. We've also experimented with having URI_COUNT there. Feel free to look inside our "dist" directory for a small example of index 1. As you probably know, you can use Luke to visualize the lucene index.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core Lucene based

Clone this wiki locally