forked from dbpedia-spotlight/dbpedia-spotlight
-
Notifications
You must be signed in to change notification settings - Fork 0
Workflow
sandroacoelho edited this page Jul 26, 2013
·
7 revisions
Workflows for DBpedia Spotlight, taking into account two major stages: training/indexing time (building the service) and execution time (running the service online). It includes a vague description of how things work now, and some roadmapping for GSoC2012.
Generic flow
- DBpedia Extraction: create resource URIs, extract properties from infoboxes, extract categories, extract redirects, disambiguations, perform redirect TC
- Context Extraction: extract occurrences (paragraphs with wikilinks), articles and definitions.
- CandidateMap Extraction: built from titles, redirects, disambiguates and anchor links
- Computing Statistics: counts of uris, surface forms, tokens, topics and their co-occurrences
- Storage: load statistics into the chosen data storage
- Training Spotter: based on the statistics above, train an algorithm that chooses substrings that should be disambiguated from incoming texts
- Training Disambiguator: based on the statistics above, train an algorithm that chooses the most likely URI based on surface form, context and topic
- Training Linker: based on the statistics above, and a test run, train a linker to detect NILs and to adjust to different annotation styles
Current flow as of v0.5
- DBpedia Extraction: single machine, with the DEF batch and streaming.
- categories are not used yet.
- Context Extraction: single machine, with ExtractOccsFromWikipedia. Batch only.
- CandidateMap Extraction: single machine, with ExtractCandidateMap. Batch only.
- Computing Statistics, and...
- ...Storage: both in one go with Lucene, based on IndexMergedOccurrences. Statistics TF(t,uri), DF(t,uri). Batch only.
- Training Spotter: use trained OpenNLP components, create a dictionary for lexicon-based spotters (LingPipe, etc.)
- Training Disambiguator: no training (ranking approach, based on Lucene)
- Training Linker: currently based on thresholding (discarding low-scored disambiguations) trained together with EvalDisambiguationOnly. To be separated soon for 0.6.
New flow for GSoC 2012:
- DBpedia Extraction: same (being improved separately from our projects)
- Flattening category hierarchy (Dirk)
- Context Extraction: map-reduce via Pig/Hadoop (Max/Chris)
- CandidateMap Extraction: map-reduce via Pig/Hadoop (Max/Chris)
- Computing Statistics:
- co-occurrences of (sf,uri,context): map-reduce via Pig/Hadoop (Chris)
- co-occurrences of category x (sf,uri,context,cat): map-reduce via Pig/Hadoop (Dirk)
- co-occurrences of entities: map-reduce via Pig/Hadoop (Hector)
- Storage: JDBM3, Mem (Jo), Lucene (Pablo,Max), Other?
- Training Spotter: Train new spotter based on stats computed here
- Training Disambiguator: train disambiguators based on probabilities computed above (Pablo and Jo)
- Training Linker: find good decision functions (Pablo)
- Run spotting
- Run topic classification
- Run candidate selection
- Run disambiguation
- Run linking
Dirk works on topic classification
- setup:
- flattens Wikipedia hierarchy to top 20-30 categories (topics)
- associates every DBpedia URI to each of the topics
- extracts one TSV file (or one lucene index) per topic
- topic classifier training (updateable/streaming)
- execution:
- topic classification: input=text, output=topics (categories)
Jo works on a DB-backed core supporting the Entity-Mention generative model:
- setup:
- run from jar -> load needed files from stream instead of files
- reorganizes configuration
- loading count statistics to database (counts come TSV files)
- smoothing counts (e.g. truncate count<5, add 5, smooth at query time)
- execution:
- computing probabilities from smoothed counts and using them for disambiguation
Chris works on complementary vector space models of words/URIs
- setup:
- extract (on Pig) one TSV file per resource type (people, org, etc.)
- compute statistics of word-resource and resource-word vectors on pig/hadoop
- execution
- explicit semantic analysis: input=text, output=vector of weighted DBpedia resources
- disambiguation: input=text+surface_form, output=uri (using ESA from above)
Hector works on collective disambiguation:
- setup:
- basic entity-entity statistics computed from occs.tsv
- loads resource association statistics
- other association statistics may come from different strategies as discussed above.
- builds (weighted) graphs of interconnections between entities
- execution:
- reweights candidate scores based on other candidates in the same context
Project
- Introduction
- Glossary
- User's manual
- Web application
- Installation
- Internationalization
- Licenses
- Researcher
- How to cite
- Support and Feedback
- Troubleshooting
- Team
- Acknowledgements
Statistical backend
Lucene backend
- Introduction
- Downloads
- Architecture
- Internationalization
- Web service parameters / API
- Splitting occurrences into topics
Developers