Indexing with Pignlproc and Hadoop

This page explains the steps required to generate several types of indexes from an XML dump of Wikipedia using Apache Pig and Hadoop. Some of this code is still in the testing phase, but each of the steps described here have been tested successfully using clients running Ubuntu 10.04 and 12.04, and a Hadoop cluster of five machines running Ubuntu 10.04, as well as in local and pseudo-distributed modes on a single machine.

What You'll Need

JDK >= 1.6

Apache Maven 2.2.1

Apache Pig version >= 0.8.1 (version 0.10.0 preferred)

Hadoop >= 0.20.x (tested with 0.20.2) - if you only want to test the scripts in local mode, you won't need to install Hadoop, as Pig comes with a bundled Hadoop installation for local mode.

*Optional: A stopword list with one word per line, such as [the one used by DBpedia-Spotlight] (http://spotlight.dbpedia.org/download/release-0.4/stopwords.en.list)

Steps

Clone this fork of pignlproc

git clone https://github.com/dbpedia-spotlight/pignlproc.git /your/local/dir

From the top dir of pignlproc, build the project by running:

mvn package

**this will compile the project and run the tests. If you want to skip tests:

mvn package -DskipTests=true

Set JAVA_HOME to the location of your JDK:
i.e.

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0

Add Apache Pig to your PATH:
i.e.

export PATH=/home/<user>/programs/pig-0.10.0/bin:$PATH

You're now ready to test in local mode. Modify the parameters in indexer-local.pig.params to control the indexing scripts. Note that not all params are used in all scripts - you can look at the top of each script to find the params that it uses.

Param	Function
Input/Output
$INPUT	Specifies the path to the xml wikidump
$OUTPUT_DIR	Specifies the directory for output
Configuration
$MAX_SPAN_LENGTH	The maximum span in chars for individual (not aggregate) contexts
$MIN_COUNT	The minimum count for a token to be included in a resource's token index
$MIN_CONTEXTS	The minimum number of contexts a resource must have to be included
$NUM_DOCS	The number of resources in the wikidump (used to approximate idf)
$N	A cutoff param to keep only the top N tokens for a resource
$URI_LIST	For filtering scripts, the location of the uri whitelist
Language-Specific
$LANG	The lowercase language code ('en', 'fr', 'de', etc...)
$ANALYZER_NAME*	The name of the Lucene Analyzer to use (EnglishAnalyzer, FrenchAnalyzer, etc...)
$STOPLIST_NAME	The filename of the stopwords file to use (i.e. 'stopwords.en.list')
$STOPLIST_PATH	The path to the dir containing the stoplist (i.e. /user/hadoop/stoplists)
$PIGNLPROC_JAR	The local path to the JAR containing the Pignlproc UDFs
Hadoop Configuration
$DEFAULT_PARALLEL	The default number of reducers to use (may be overridden by your cluster config}

To test in local mode, modify indexer-local.pig.params for your local setup, then run examples/indexing/indexer_small_cluster.pig From top dir of pignlproc:

pig -x local -m indexer-local.pig.params examples/indexing/indexer_small_cluster.pig

 -Note: change $OUTPUT_DIR if necessary to point to an output dir that works for you, and $STOPLIST_PATH to point to the directory containing your stoplist. When the script finishes, check $OUTPUT_DIR to confirm output.

If you want to run indexing on an actual Hadoop cluster, you'll first need to put your wikidump and stoplist into the Hadoop File System (HDFS).
```
hadoop fs -put /location/of/enwiki-latest-pages-articles.xml /location/on/hdfs/enwiki-latest-pages-articles.xml    
hadoop fs -put /location/of/stopwords.en.list /location/on/hdfs/stopwords.en.list    
```
-Note: Although Pig supports automatic loading of files with .bz2 compression, this feature is not currently implemented in the custom loader in pignlproc. Thus, the extracted version of the XML dump is currently required. This inconvenience will be resolved in the near future.

You should also define your output dir ($DIR) as a directory in HDFS

Note also that the parameters containing paths must now be paths in HDFS

There are currently six possibilities for indexing:

Script	Function
indexer_small_cluster.pig	create an index {URI, {(term, count),...}} - count in UDF
indexer_lucene.pig	create an index of {URI, {(term, count),...}} - count in MapRed
tfidf.pig	create a tfidf index: {URI, {(term, weight),...}}
uri_to_context_indexer.pig	create an index {URI, aggregated context} (one long chararray)
uri_to_context_indexer_filter.pig	same as above, except filter by a user-provided list of URIs
sf_group.pig	create an index {SurfaceForm, {(URI),...}, count}

-- Testing has shown that indexer_small_cluster.pig is much more efficient than indexer_lucene.pig on small/mid-sized clusters (up to 35 mappers and 15 reducers), but these scripts have not yet been tested on a very large cluster.

Make sure that $HADOOP_HOME is set to the location of your local Hadoop installation.

echo $HADOOP_HOME
/local/hadoop-0.20.2

From your client machine you can now run the script using something like the following:

pig -m indexer.pig.params indexer_small_cluster.pig

--- substitute 'indexer_small_cluster.pig' with 'tfidf.pig' to try the script that creates a tfidf index.

##Output## Support for bz2 compressed JSON output has been added using pignlproc.storage.JsonCompressedStorage. If you want to change the output format, modify the last line of the scripts. i.e.

STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING PigStorage();

or

DEFINE JsonCompressedStorage pignlproc.storage.JsonCompressedStorage();
STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING JsonCompressedStorage();

Notes:
1- the speed of indexing obviously depends upon the size of your cluster. Constraints such as the size of hadoop.tmp.dir may also affect performance. This code has been tested on the full English Wikipedia with (relatively) good performance on a five-node cluster.
2- you only need the pignlproc JAR and the example scripts to use this code with Hadoop. You can just copy examples/indexing/indexer_small_cluster.pig, examples/indexing/indexer_lucene.pig and target/pignlproc-0.1.0-SNAPSHOT.jar to a client machine configured for your cluster if you built the project on a different machine (you'll need Apache Pig though).
3- Once indexing has finished, to get the files back on to your local machine, do:
hadoop fs -get /path/to/hadoop/output /path/to/local
if you want one big file instead of the part-* files, do:
hadoop fs -getmerge /path/to/hadoop/output /path/to/local
*if you used JsonCompressedStorage(), you'll need to delete the automatically generated header and schema files before using -getmerge, otherwise the file will not be recognized as bz2.
Example:

hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_header
hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_schema

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing with Pignlproc and Hadoop

What You'll Need

Steps

Clone this wiki locally