-
Notifications
You must be signed in to change notification settings - Fork 0
pignlproc indexing TODO and issues
This page was created to track the TODO list for Hadoop indexing using Apache Pig, and to document known issues.
We might want to fork pignlproc inside the dbpedia-spotlight account. Then, when someone wants to work on one of the things below, a GitHub issue can be opened and assigned. Or we could move all these things into GitHub issues anyways.
- output compressed JSON - Update: fixed using JsonCompressedStorage
- Refactor scripts to eliminate duplicate code (making use of macros & imports where possible), and eliminate any non-essential scripts from repo. Update: non-essential scripts have been eliminated, but we still need to add macros for the Wikipedia .xml dump processing part of the scripts and deal with some duplicate code.
- Create default config file to simplify indexing for new users - Update: see
indexer-local.pig.params and indexer.pig.params - Allow user-provided language argument to specify the correct analyzer for the given language, optionally including a user-provided Stoplist.
*Note → for consistency, the parameter should probably be the exact Analyzer name (eg. ‘EnglishAnalyzer’) as used in org.apache.lucene.analysis - otherwise we’ll need to implement rather messy heuristics to figure out which analyzer the user wants.
Update: implemented for pignlproc.index.{LuceneTokenizer, GetCountsLucene}. User must specify language code and analyzer corresponding to the namespace structure of Lucene 3.6 org.apache.lucene.analysis
- Resolve redirects - Update: redirect resolution has been implemented by Pablo in dbpedia-spotlight, using RedirectResolveFilter, but this still requires the user to first run ExtractCandidateMap. It would be better to do this directly during Hadoop indexing.
- Extract the paragraph on a page X as context for concept X
- Compute total (Wikipedia-wide) token counts
- Attempt to make context window size flexible
- Parse bullet points as context on disambiguation pages
- Modify [ParsingWikipediaLoader] (https://github.com/chrishokamp/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java) to allow bzipped input - some attempts at this have already been made, but enabling compressed input for custom loaders is not as straightforward as it might seem.
- Write quality tests using PigUnit for new classes in pignlproc - GetCountsLucene and ParagraphsWithLink especially.
-
The calculation of 'keyphraseness' in nerd-stats.pig is currently rather inefficient, because generating the Ngrams requires parsing through each article multiple times to generate all possible surface forms up to MAX_NGRAM_LENGTH. A better way could be to store each surface form observed as anchor text, then then count the total number of times each string occurred in a second run through the data. This approach could cause memory issues, however, because the set of surface forms is very large (currently 7,154,437 in index).
*** see NGramGenerator.java as well -
There is a bug somewhere in the tokenization process (may also be in the gtwiki bliki engine) that outputs the token {(...),(2,3,7,4)(...)} for the URI http://en.wikipedia.org/wiki/0 where the schema of the bag is {(token1, count1), (token2, count2),...}. This would appear to be the token "2,3,7" with count=4, but we have checked all contexts (via original pages) for this token, and it does not exist. Tokens with commas also are indicative of another issue that could make parsing of the output of scripts such as indexer_small_cluster.pig more tricky, because the atoms in the tuples are comma-delimited. When JSON output is used, this issue is eliminated, but JSON makes the index size considerably larger, because field names are included with every item.
-
The page titles that pignlproc produces do not match the DBpedia URI encoding scheme. When we compare URIs from DBpedia with Wikipedia page titles, we will fail. We will have to encode the latter *** see WikiUtil.scala. This applies to all of the current indexing scripts in pignlproc.
-
Disambiguation pages are included in the sf → {(URI)} and the URI → {(contextToken,count)} datasets
Notes on compression:
-
using compression in intermediary files --> the following options generally work, and may substantially decrease the space taken up by the intermediate output of mappers, but may also significantly slow down clusters where gzip decompression is a bottleneck:
SET mapred.output.compress true;
SET mapred.output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
SET pig.tmpfilecompression true;
SET pig.tmpfilecompression.codec gz;
For disambiguation methods taking into account the position of the surrounding words, it would be good to:
- include the average distance of the context word to the surface form (for each entity)
- just save the complete sentence (in the way that ExtractOccsFromWikipedia does).
Project
- Introduction
- Glossary
- User's manual
- Web application
- Installation
- Internationalization
- Licenses
- Researcher
- How to cite
- Support and Feedback
- Troubleshooting
- Team
- Acknowledgements
Statistical backend
Lucene backend
- Introduction
- Downloads
- Architecture
- Internationalization
- Web service parameters / API
- Splitting occurrences into topics
Developers