Workflow

Workflows for DBpedia Spotlight, taking into account two major stages: training/indexing time (building the service) and execution time (running the service online). It includes a vague description of how things work now, and some roadmapping for GSoC2012.

Training Time (aka Indexing Time)

Data Preprocessing

Generic flow

DBpedia Extraction: create resource URIs, extract properties from infoboxes, extract categories, extract redirects, disambiguations, perform redirect TC
Context Extraction: extract occurrences (paragraphs with wikilinks), articles and definitions.
CandidateMap Extraction: built from titles, redirects, disambiguates and anchor links
Computing Statistics: counts of uris, surface forms, tokens, topics and their co-occurrences
Storage: load statistics into the chosen data storage
Training Spotter: based on the statistics above, train an algorithm that chooses substrings that should be disambiguated from incoming texts
Training Disambiguator: based on the statistics above, train an algorithm that chooses the most likely URI based on surface form, context and topic
Training Linker: based on the statistics above, and a test run, train a linker to detect NILs and to adjust to different annotation styles

Current flow as of v0.5

DBpedia Extraction: single machine, with the DEF batch and streaming.
categories are not used yet.
Context Extraction: single machine, with ExtractOccsFromWikipedia. Batch only.
CandidateMap Extraction: single machine, with ExtractCandidateMap. Batch only.
Computing Statistics, and...
...Storage: both in one go with Lucene, based on IndexMergedOccurrences. Statistics TF(t,uri), DF(t,uri). Batch only.
Training Spotter: use trained OpenNLP components, create a dictionary for lexicon-based spotters (LingPipe, etc.)
Training Disambiguator: no training (ranking approach, based on Lucene)
Training Linker: currently based on thresholding (discarding low-scored disambiguations) trained together with EvalDisambiguationOnly. To be separated soon for 0.6.

New flow for GSoC 2012:

DBpedia Extraction: same (being improved separately from our projects)
Flattening category hierarchy (Dirk)
Context Extraction: map-reduce via Pig/Hadoop (Max/Chris)
CandidateMap Extraction: map-reduce via Pig/Hadoop (Max/Chris)
Computing Statistics:
co-occurrences of (sf,uri,context): map-reduce via Pig/Hadoop (Chris)
co-occurrences of category x (sf,uri,context,cat): map-reduce via Pig/Hadoop (Dirk)
co-occurrences of entities: map-reduce via Pig/Hadoop (Hector)
Storage: JDBM3, Mem (Jo), Lucene (Pablo,Max), Other?
Training Spotter: Train new spotter based on stats computed here
Training Disambiguator: train disambiguators based on probabilities computed above (Pablo and Jo)
Training Linker: find good decision functions (Pablo)

Execution Time

Run spotting
Run topic classification
Run candidate selection
Run disambiguation
Run linking

Project interactions

Dirk works on topic classification

setup:
flattens Wikipedia hierarchy to top 20-30 categories (topics)
associates every DBpedia URI to each of the topics
extracts one TSV file (or one lucene index) per topic
topic classifier training (updateable/streaming)
execution:
topic classification: input=text, output=topics (categories)

Jo works on a DB-backed core supporting the Entity-Mention generative model:

setup:
run from jar -> load needed files from stream instead of files
reorganizes configuration
loading count statistics to database (counts come TSV files)
smoothing counts (e.g. truncate count<5, add 5, smooth at query time)
execution:
computing probabilities from smoothed counts and using them for disambiguation

Chris works on complementary vector space models of words/URIs

setup:
extract (on Pig) one TSV file per resource type (people, org, etc.)
compute statistics of word-resource and resource-word vectors on pig/hadoop
execution
explicit semantic analysis: input=text, output=vector of weighted DBpedia resources
disambiguation: input=text+surface_form, output=uri (using ESA from above)

Hector works on collective disambiguation:

setup:
basic entity-entity statistics computed from occs.tsv
loads resource association statistics
other association statistics may come from different strategies as discussed above.
builds (weighted) graphs of interconnections between entities
execution:
reweights candidate scores based on other candidates in the same context

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow

Training Time (aka Indexing Time)

Data Preprocessing

Execution Time

Project interactions

Clone this wiki locally