-
Notifications
You must be signed in to change notification settings - Fork 0
Indexing with Pignlproc and Hadoop
This page explains the steps required to generate several types of indexes from an XML dump of Wikipedia using Apache Pig and Hadoop. Some of this code is still in the testing phase, but each of the steps described here have been tested successfully using clients running Ubuntu 10.04 and 12.04, and a Hadoop cluster of five machines running Ubuntu 10.04, as well as in local and pseudo-distributed modes on a single machine.
Apache Pig version >= 0.8.1 (version 0.10.0 preferred)
Hadoop >= 0.20.x (tested with 0.20.2) - if you only want to test the scripts in local mode, you won't need to install Hadoop, as Pig comes with a bundled Hadoop installation for local mode.
*Optional: A stopword list with one word per line, such as [the one used by DBpedia-Spotlight] (http://spotlight.dbpedia.org/download/release-0.4/stopwords.en.list)
- Clone this fork of pignlproc
git clone https://github.com/dbpedia-spotlight/pignlproc.git /your/local/dir
- From the top dir of pignlproc, build the project by running:
mvn package
**this will compile the project and run the tests. If you want to skip tests:
mvn package -DskipTests=true
- Set JAVA_HOME to the location of your JDK:
i.e.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
Add Apache Pig to your PATH:
i.e.
export PATH=/home/<user>/programs/pig-0.10.0/bin:$PATH
- You're now ready to test in local mode. Modify the parameters in indexer-local.pig.params to control the indexing scripts. Note that not all params are used in all scripts - you can look at the top of each script to find the params that it uses.
Param | Function |
---|---|
Input/Output | |
$INPUT | Specifies the path to the xml wikidump |
$OUTPUT_DIR | Specifies the directory for output |
Configuration | |
$MAX_SPAN_LENGTH | The maximum span in chars for individual (not aggregate) contexts |
$MIN_COUNT | The minimum count for a token to be included in a resource's token index |
$MIN_CONTEXTS | The minimum number of contexts a resource must have to be included |
$NUM_DOCS | The number of resources in the wikidump (used to approximate idf) |
$N | A cutoff param to keep only the top N tokens for a resource |
$URI_LIST | For filtering scripts, the location of the uri whitelist |
Language-Specific | |
$LANG | The lowercase language code ('en', 'fr', 'de', etc...) |
$ANALYZER_NAME* | The name of the Lucene Analyzer to use (EnglishAnalyzer, FrenchAnalyzer, etc...) |
$STOPLIST_NAME | The filename of the stopwords file to use (i.e. 'stopwords.en.list') |
$STOPLIST_PATH | The path to the dir containing the stoplist (i.e. /user/hadoop/stoplists) |
$PIGNLPROC_JAR | The local path to the JAR containing the Pignlproc UDFs |
Hadoop Configuration | |
$DEFAULT_PARALLEL | The default number of reducers to use (may be overridden by your cluster config} |
To test in local mode, modify indexer-local.pig.params for your local setup, then run examples/indexing/indexer_small_cluster.pig From top dir of pignlproc:
pig -x local -m indexer-local.pig.params examples/indexing/indexer_small_cluster.pig
-Note: change $OUTPUT_DIR if necessary to point to an output dir that works for you, and $STOPLIST_PATH to point to the directory containing your stoplist. When the script finishes, check $OUTPUT_DIR to confirm output.
-
If you want to run indexing on an actual Hadoop cluster, you'll first need to put your wikidump and stoplist into the Hadoop File System (HDFS).
hadoop fs -put /location/of/enwiki-latest-pages-articles.xml /location/on/hdfs/enwiki-latest-pages-articles.xml hadoop fs -put /location/of/stopwords.en.list /location/on/hdfs/stopwords.en.list
-Note: Although Pig supports automatic loading of files with .bz2 compression, this feature is not currently implemented in the custom loader in pignlproc. Thus, the extracted version of the XML dump is currently required. This inconvenience will be resolved in the near future.
You should also define your output dir ($DIR) as a directory in HDFS
Note also that the parameters containing paths must now be paths in HDFS
There are currently six possibilities for indexing:
Script | Function |
---|---|
indexer_small_cluster.pig | create an index {URI, {(term, count),...}} - count in UDF |
indexer_lucene.pig | create an index of {URI, {(term, count),...}} - count in MapRed |
tfidf.pig | create a tfidf index: {URI, {(term, weight),...}} |
uri_to_context_indexer.pig | create an index {URI, aggregated context} (one long chararray) |
uri_to_context_indexer_filter.pig | same as above, except filter by a user-provided list of URIs |
sf_group.pig | create an index {SurfaceForm, {(URI),...}, count} |
-- Testing has shown that indexer_small_cluster.pig is much more efficient than indexer_lucene.pig on small/mid-sized clusters (up to 35 mappers and 15 reducers), but these scripts have not yet been tested on a very large cluster.
Make sure that $HADOOP_HOME is set to the location of your local Hadoop installation.
echo $HADOOP_HOME
/local/hadoop-0.20.2
From your client machine you can now run the script using something like the following:
pig -m indexer.pig.params indexer_small_cluster.pig
--- substitute 'indexer_small_cluster.pig' with 'tfidf.pig' to try the script that creates a tfidf index.
##Output## Support for bz2 compressed JSON output has been added using pignlproc.storage.JsonCompressedStorage. If you want to change the output format, modify the last line of the scripts. i.e.
STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING PigStorage();
or
DEFINE JsonCompressedStorage pignlproc.storage.JsonCompressedStorage();
STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING JsonCompressedStorage();
Notes:
1- the speed of indexing obviously depends upon the size of your cluster. Constraints such as the size of hadoop.tmp.dir may also affect performance. This code has been tested on the full English Wikipedia with (relatively) good performance on a five-node cluster.
2- you only need the pignlproc JAR and the example scripts to use this code with Hadoop. You can just copy examples/indexing/indexer_small_cluster.pig, examples/indexing/indexer_lucene.pig and target/pignlproc-0.1.0-SNAPSHOT.jar to a client machine configured for your cluster if you built the project on a different machine (you'll need Apache Pig though).
3- Once indexing has finished, to get the files back on to your local machine, do:
hadoop fs -get /path/to/hadoop/output /path/to/local
if you want one big file instead of the part-* files, do:
hadoop fs -getmerge /path/to/hadoop/output /path/to/local
*if you used JsonCompressedStorage(), you'll need to delete the automatically generated header and schema files before using -getmerge, otherwise the file will not be recognized as bz2.
Example:
hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_header
hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_schema
Project
- Introduction
- Glossary
- User's manual
- Web application
- Installation
- Internationalization
- Licenses
- Researcher
- How to cite
- Support and Feedback
- Troubleshooting
- Team
- Acknowledgements
Statistical backend
Lucene backend
- Introduction
- Downloads
- Architecture
- Internationalization
- Web service parameters / API
- Splitting occurrences into topics
Developers