-
Notifications
You must be signed in to change notification settings - Fork 0
Internationalization (DB backed core)
Important note: This page explains how to create a Spotlight model on your own server. This is a detailed tutorial to explain each step, however a fully automated script for these steps can be found here.
For this part, you need Apache Hadoop and Apache Pig. If you don't have Hadoop and Pig installed, we recommend the following tutorials for setting up Hadoop. The indexing can also be run on a single machine, in this case it is enough to download Apache Pig and run it in local mode (add "-x local" after every pig command to run locally without hadoop).
and for Apache Pig:
For more details on Hadoop-based indexing, see Indexing with Pignlproc and Hadoop, which also contains all required versions.
In the following sections, we take the Dutch language as an example. If you want to run the indexing for other languages, just replace the nl
(the language code for Dutch) with its corresponding language code. We also assume the default working directory is /user/hadoop
in HDFS.
This section provides a quick way of creating the Spotlight model by executing the indexing script. Before that, you need to prepare some models as shown in the following two points. Meanwhile, the following programs must be available on your server: hadoop
, pig
, mvn
, git
, curl
.
-
Create working directory and download OpenNLP models for your language:
$ mkdir -p /data/spotlight/nl $ ls /data/spotlight/nl/opennlp nl-chunker.bin nl-pos-maxent.bin nl-sent.bin nl-token.bin
Note: the working directory is given as an absolute path. The NLP models can be download from http://opennlp.sourceforge.net/models-1.5/.
-
Create a list of stopwords:
$ head -n5 /data/spotlight/nl/stopwords.nl.list de en van ik te
-
Run the indexing script, which would create the model
/data/spotlight/nl/model_nl
:$ cd /data/spotlight/nl $ wget https://raw.github.com/jodaiber/dbpedia-spotlight/master/bin/index_db.sh $ ./index_db.sh -o /data/spotlight/nl/opennlp /data/spotlight/nl nl_NL /data/spotlight/nl/stopwords.nl.list Dutch /data/spotlight/nl/model_nl
Note: start up the hadoop workers before you try the above commands; paths in this command are absolute paths. You can change nl
to other language code. But remember to change Dutch
to the corresponding language specific Lucene Analyzer, e.g. English
for EnglishAnalyzer
.
This section describes the detailed steps for creating the Spotlight model. These steps are all performed by index_db.sh.
-
Prepare a stopword list file under
/data/spotlight/nl
and NLP models under/data/spotlight/nl/opennlp
as the quick way. -
Download DBpedia data (see here)
$ mkdir -p /data/spotlight/nl/processed/ $ cd /data/spotlight/nl/processed/ $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-redirects.ttl.gz | gzcat > redirects.nt $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-disambiguations.ttl.gz | gzcat > disambiguations.nt $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-instance-types.ttl.gz | gzcat > instance_types.nt
Note: if gzcat
is not available, then replace it with gunzip -c
.
-
Download the Wikipedia dump:
$ cd /data/spotlight/nl $ wget http://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2
-
Check out and build our version of pignlproc:
$ mkdir pig $ cd pig $ git clone git://github.com/dbpedia-spotlight/pignlproc.git
Note: There are redirect definitions for most languages that have a local Wikipedia, if you are unsure if your language is among those, check the that the language is supported in the method
getRedirectPatterns
in AnnotatingMarkupParser.$ mvn assembly:assembly -Dmaven.test.skip=true
Note: if fails due to the
core-0.6.jar
oforg.dbpedia.spotlight
is not available from theinfo-bliki-repository
, then you need to prepare that jar file by downloading the code of Spotlgith andmvn install
it manually. -
Split the corpus in train, tune and test sets and move the training part into HDFS:
$ cd /data/spotlight/nl $ bzcat nlwiki-latest-pages-articles.xml.bz2 | python pig/pignlproc/utilities/split_train_test.py 12000 /data/spotlight/nl/processed/test.txt | hadoop fs -put nlwiki-latest-pages-articles.xml
Move the stopwords and tokenizer model into HDFS:
$ hadoop fs -put /data/spotlight/nl/stopwords.nl.list stopwords.nl.list
$ hadoop fs -put /data/spotlight/nl/opennlp/nl-token.bin nl.tokenizer_model
-
Adapt examples/indexing/token_counts.pig.params and examples/indexing/names_and_entities.pig.params to your language. See the two linked files for the example of Dutch.
Note: Due to line 87 of RestrictedNGramGenerator.java, the path for the tokenizer model is fixed to
./nl.tokenizer_model
. Thus, you have to make your working directory be the default working directory./
in HDFS, i.e,/user/hadoop
for our example. Otherwise the function would report error for missing thenl.tokenizr_model
file. -
Run Apache Pig:
$ cd /data/spotlight/nl/pig/pignlproc $ pig -m examples/indexing/token_counts.pig.params examples/indexing/token_counts.pig $ pig -m examples/indexing/names_and_entities.pig.params examples/indexing/names_and_entities.pig
Note: If you got "java.lang.OutOfMemoryError" error,try to set heap space larger by the following:1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m'; 2) commented out this line: --set io.sort.mb 1024
-
Move the results of both jobs:
hadoop fs -cat tokenCounts/tokenCounts/part* > tokenCounts hadoop fs -cat names_and_entities/pairCounts/part* > pairCounts hadoop fs -cat names_and_entities/uriCounts/part* > uriCounts hadoop fs -cat names_and_entities/sfAndTotalCounts/part* > sfAndTotalCounts
then, you should have the following files:
$ ls /data/spotlight/nl/processed/
pairCounts sfAndTotalCounts tokenCounts uriCounts
-
Create the Spotlight model
/data/spotlight/nl/model_nl
with :$ java -jar dbpedia-spotlight.jar org.dbpedia.spotlight.db.CreateSpotlightModel nl_NL /data/spotlight/nl/processed/ /data/spotlight/nl/model_nl /data/spotlight/nl/opennlp /data/spotlight/nl/stopwords.nl.list None
This will create the following Spotlight model folder:
$ tree /data/spotlight/nl/model_nl/ /data/spotlight/nl/model_nl/ ├── model │ ├── candmap.mem │ ├── context.mem │ ├── res.mem │ ├── sf.mem │ └── tokens.mem ├── model.properties ├── opennlp │ ├── chunker.bin │ ├── pos-maxent.bin │ ├── sent.bin │ └── token.bin ├── opennlp_chunker_thresholds.txt └── stopwords.list
That's it, you can now run the server with your newly created model.
$ java -jar dbpedia-spotlight.jar /data/spotlight/nl/model_nl http://localhost:2222/rest
Note: If you want only fast run Statistical backend without creating a model, there are pre-built models available from the download page.
Project
- Introduction
- Glossary
- User's manual
- Web application
- Installation
- Internationalization
- Licenses
- Researcher
- How to cite
- Support and Feedback
- Troubleshooting
- Team
- Acknowledgements
Statistical backend
Lucene backend
- Introduction
- Downloads
- Architecture
- Internationalization
- Web service parameters / API
- Splitting occurrences into topics
Developers