pignlproc indexing TODO and issues

This page was created to track the TODO list for Hadoop indexing using Apache Pig, and to document known issues.

We might want to fork pignlproc inside the dbpedia-spotlight account. Then, when someone wants to work on one of the things below, a GitHub issue can be opened and assigned. Or we could move all these things into GitHub issues anyways.

TODO

output compressed JSON - Update: fixed using JsonCompressedStorage
Refactor scripts to eliminate duplicate code (making use of macros & imports where possible), and eliminate any non-essential scripts from repo. Update: non-essential scripts have been eliminated, but we still need to add macros for the Wikipedia .xml dump processing part of the scripts and deal with some duplicate code.
Create default config file to simplify indexing for new users - Update: see
indexer-local.pig.params and indexer.pig.params
Allow user-provided language argument to specify the correct analyzer for the given language, optionally including a user-provided Stoplist.
*Note → for consistency, the parameter should probably be the exact Analyzer name (eg. ‘EnglishAnalyzer’) as used in org.apache.lucene.analysis - otherwise we’ll need to implement rather messy heuristics to figure out which analyzer the user wants.
Update: implemented for pignlproc.index.{LuceneTokenizer, GetCountsLucene}. User must specify language code and analyzer corresponding to the namespace structure of Lucene 3.6 org.apache.lucene.analysis

Resolve redirects - Update: redirect resolution has been implemented by Pablo in dbpedia-spotlight, using RedirectResolveFilter, but this still requires the user to first run ExtractCandidateMap. It would be better to do this directly during Hadoop indexing.
Extract the paragraph on a page X as context for concept X
Compute total (Wikipedia-wide) token counts
Attempt to make context window size flexible
Parse bullet points as context on disambiguation pages
Modify [ParsingWikipediaLoader] (https://github.com/chrishokamp/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java) to allow bzipped input - some attempts at this have already been made, but enabling compressed input for custom loaders is not as straightforward as it might seem.
Write quality tests using PigUnit for new classes in pignlproc - GetCountsLucene and ParagraphsWithLink especially.

Issues

The calculation of 'keyphraseness' in nerd-stats.pig is currently rather inefficient, because generating the Ngrams requires parsing through each article multiple times to generate all possible surface forms up to MAX_NGRAM_LENGTH. A better way could be to store each surface form observed as anchor text, then then count the total number of times each string occurred in a second run through the data. This approach could cause memory issues, however, because the set of surface forms is very large (currently 7,154,437 in index).
*** see NGramGenerator.java as well
There is a bug somewhere in the tokenization process (may also be in the gtwiki bliki engine) that outputs the token {(...),(2,3,7,4)(...)} for the URI http://en.wikipedia.org/wiki/0 where the schema of the bag is {(token1, count1), (token2, count2),...}. This would appear to be the token "2,3,7" with count=4, but we have checked all contexts (via original pages) for this token, and it does not exist. Tokens with commas also are indicative of another issue that could make parsing of the output of scripts such as indexer_small_cluster.pig more tricky, because the atoms in the tuples are comma-delimited. When JSON output is used, this issue is eliminated, but JSON makes the index size considerably larger, because field names are included with every item.
The page titles that pignlproc produces do not match the DBpedia URI encoding scheme. When we compare URIs from DBpedia with Wikipedia page titles, we will fail. We will have to encode the latter *** see WikiUtil.scala. This applies to all of the current indexing scripts in pignlproc.
Disambiguation pages are included in the sf → {(URI)} and the URI → {(contextToken,count)} datasets

Notes on compression:

using compression in intermediary files --> the following options generally work, and may substantially decrease the space taken up by the intermediate output of mappers, but may also significantly slow down clusters where gzip decompression is a bottleneck:
SET mapred.output.compress true;
SET mapred.output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
SET pig.tmpfilecompression true;
SET pig.tmpfilecompression.codec gz;

Long Term

For disambiguation methods taking into account the position of the surrounding words, it would be good to:

include the average distance of the context word to the surface form (for each entity)
just save the complete sentence (in the way that ExtractOccsFromWikipedia does).

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pignlproc indexing TODO and issues

TODO

Issues

Long Term

Clone this wiki locally