Data generation manual

Summary of the data generation guidelines

Download a few files from the latest DBpedia in your language. For English we used the following datasets: labels_en.nt.bz2, redirects_en.nt.bz2, disambiguations_en.nt.bz2, instance_types_en.nt.bz2. If they are not available, you will need to look into how to create them. For higher quality extraction, consider adding infobox mappings and DBpedia Ontology labels for your language at http://mappings.dbpedia.org. The results from mapping-based extraction are available on the file: mappingbased_properties_en.nt.bz2.
Get a Wikipedia XML dump in your language (only pages-articles.xml.bz2).
Get Lucene tokenizers, stemmers, stopwords, etc. in your language. For many languages SnowballAnalyzer may be enough. For others, this will require some language-specific knowledge, so I can't help much. But see for example http://hunspell.sourceforge.net/
Extract name variations (lexicalizations) for DBpedia Resources via ExtractCandidateMap
Extract DBpedia resource occurrences via ExtractOccsFromWikipedia
Sort the TSV (tab-separated values) file extracted by URI
Run DBpedia Spotlight occurrence indexing via IndexMergedOccurrences
Add name variations and types to the index via AddSurfaceFormsToIndex and AddTypesToIndex
Optionally compress the index
Run evaluation and detect similarity threshold scores
If you'd like to try indexing using Hadoop and Apache Pig, check this page

For more details on how to run each step, see the latest version of this file: https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/bin/index.sh

We are also creating a page to catalog Internationalization issues.

Finally, if you plan to change the source code, please consider committing it back to our repository so that other people can also benefit from it - and you can be acknowledged as awesome contributor!

Table of Contents Checking the Generated Data = Candidate Map Extracted Occurrences Context Index

Checking the Generated Data =

You should make sure that each step is generating the data correctly.

Helpful tools:

Use grep/cut, etc. to inspect the text files
Use Luke to inspect the generated indexes.

Candidate Map

Check conceptURIs to see if undesirable URIs were kept (e.g. disambiguations, redirects, lists, and whatever matches your use case).

Check surface forms to see if they look like legitimate names of things in your language.

Extracted Occurrences

See how many were extracted. How many URIs have occurrences? Are the contexts with a decent length (subjective, related to your use case)?

Context Index

Use Luke to open the index. How many URIs are there in the index? Were the entity names stored in the field SURFACE_FORM? Were the counts stored in the field URI_COUNT? Were types stored in the field TYPE?

Look in the context field:

Were words stemmed?
Were stopwords removed?

Look in the surface form field:

Do they have the correct morphology? (e.g. were they all lowercased if that's what you wanted) Do they look "clean", or are there very generic references (e.g. "here" pointing to "Berlin")

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data generation manual

Table of Contents

Checking the Generated Data =

Candidate Map

Extracted Occurrences

Context Index

Clone this wiki locally