Skip to content

Data generation manual

sandroacoelho edited this page Jul 26, 2013 · 6 revisions

Summary of the data generation guidelines

  • Download a few files from the latest DBpedia in your language. For English we used the following datasets: labels_en.nt.bz2, redirects_en.nt.bz2, disambiguations_en.nt.bz2, instance_types_en.nt.bz2. If they are not available, you will need to look into how to create them. For higher quality extraction, consider adding infobox mappings and DBpedia Ontology labels for your language at http://mappings.dbpedia.org. The results from mapping-based extraction are available on the file: mappingbased_properties_en.nt.bz2.
  • Get a Wikipedia XML dump in your language (only pages-articles.xml.bz2).
  • Get Lucene tokenizers, stemmers, stopwords, etc. in your language. For many languages SnowballAnalyzer may be enough. For others, this will require some language-specific knowledge, so I can't help much. But see for example http://hunspell.sourceforge.net/
  • Extract name variations (lexicalizations) for DBpedia Resources via ExtractCandidateMap
  • Extract DBpedia resource occurrences via ExtractOccsFromWikipedia
  • Sort the TSV (tab-separated values) file extracted by URI
  • Run DBpedia Spotlight occurrence indexing via IndexMergedOccurrences
  • Add name variations and types to the index via AddSurfaceFormsToIndex and AddTypesToIndex
  • Optionally compress the index
  • Run evaluation and detect similarity threshold scores
  • If you'd like to try indexing using Hadoop and Apache Pig, check this page
For more details on how to run each step, see the latest version of this file: https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/bin/index.sh

We are also creating a page to catalog Internationalization issues.

Finally, if you plan to change the source code, please consider committing it back to our repository so that other people can also benefit from it - and you can be acknowledged as awesome contributor!

Table of Contents

Checking the Generated Data =

You should make sure that each step is generating the data correctly.

Helpful tools:

  • Use grep/cut, etc. to inspect the text files
  • Use Luke to inspect the generated indexes.

Candidate Map

Check conceptURIs to see if undesirable URIs were kept (e.g. disambiguations, redirects, lists, and whatever matches your use case).

Check surface forms to see if they look like legitimate names of things in your language.

Extracted Occurrences

See how many were extracted. How many URIs have occurrences? Are the contexts with a decent length (subjective, related to your use case)?

Context Index

Use Luke to open the index. How many URIs are there in the index? Were the entity names stored in the field SURFACE_FORM? Were the counts stored in the field URI_COUNT? Were types stored in the field TYPE?

Look in the context field:

  • Were words stemmed?
  • Were stopwords removed?
Look in the surface form field:
  • Do they have the correct morphology? (e.g. were they all lowercased if that's what you wanted) Do they look "clean", or are there very generic references (e.g. "here" pointing to "Berlin")
Clone this wiki locally