-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC2013 Progress (Zhiwei)
##About Github Link:
spotlight github (master branch and collective branch)
pignlproc github (collective branch)
Proposal Link: gsoc
###Project Short Description My project is a combination of two parts, Generalize input formats and add support for Google mention corpus and Efficient graph-based disambiguation and general performance improvements. The first part of the project aims to support more input format for training the model, which may improve Spotlight performance and widen usage range. The second part of the project is about integration and optimization of the graph-based disambiguation. Graph-based disambiguation is implemented last year but still not integrate with Spotlight master branch so we haven't take the advantages of graph-base disambiguation . In this part, I will integrate graph-base disambiguation with DB-backed back-end.
###Steps to run my code:
Build Spotlight Model from Google Corpus
-
Get my code here.
-
Compile and get a jar.
mvn assembly:assembly -Dmaven.test.skip=true
-
Get part of dataSet-context-only or dataSet-full
Put them into HDFS and edit wikilink.pig.param to point to the folder containing .gz files.
cd example/index
if use context-only data set
pig -m wikilink.pig.params wikilink_token_counts.pig
pig -m wikilink.pig.params wikilink_context_NRs.pig
if use full data set
pig -m wikilink.pig.params wikilink_token_counts.pig
pig -m wikilink.pig.params wikilink_full_NRs.pig
note: If use hadoop, only pig in version 0.10 can work. Otherwise, something like "invalid stream header" error might occurs(I don't know the reason yet).
-
Merge the results into a local folder and combine them into four files "pairCounts","sfAndTotalCounts","tokenCounts", "uriCounts"
-
Create a model from it.
java -jar dbpedia-spotlight.jar org.dbpedia.spotlight.db.CreateSpotlightModel en_US /data/spotlight/nl/processed/ /data/spotlight/nl/model_nl /data/spotlight/nl/opennlp /data/spotlight/nl/stopwords.nl.list EnglishStemmer
A built model can be fould here, TSV counts.
I update steps to run make graph and run graphBaseDisambiguator in steps for graph disambiguation
##Progress ###Applying Period(April 10-May 3) 1.Set up development environment for Spotlight and pignlproc with intellij.
2.Created SpotlightModel using both pignlproc and lucene. Ran Spotlight to annotate some text. Figured out the code structure of pignlproc and Spotlight.
3.Ran Graph-based Disambiguation Model in Hector's repository. Figured out the code structure of Graph-based Disambiguation.
4.Extended DBpediaResourceSource and CandidateMapSource to get uriCount and candidateMap from Wiki-link data.
###Community Bonding Period(May 27-June 16)
1.Read documents about Hadoop.
2.Set up a cluster of 3 small computer with hadoop.(June 5th)
3.Read documents about PigLatin and UDF.(June 11th)
4.Read some documents about RDF.
###Coding Period(June 17-Sept 22) Branch: https://github.com/caizhiwei/pignlproc/tree/wiki-link
June 17-18
- Fugure out the structure of LoadFunc and FileInputFormat.
- Implement a skeleton of WikilinkLoader and WikilinkInputFormat.
June 19-20
- The APIs provided by UMass may not works in hadoop since they only support iteration reading from local file. Try to read inside the wiki-link code and find a way to read in hadoop FS.
- Solve some compatibility between Scala code(provided by UMass) and java.
June 21
Implement a rougth version of LoadFunc and InputFormat for wiki-link Dataset.
June 24-27
Prepare for final exam this week. I will try to keep on progress using this weekend.
June 28
- Extract context in sentence format.
- Write testing code for WikiLinkInputFormat and WikiLinkLoader.
June 29
Simple processing code for wiki-link in pigLatin.
https://github.com/caizhiwei/pignlproc/blob/wiki-link/examples/indexing/wikilink.pig
idAndMentions contains two field (docId,mentions) and mentions is a dataBag containing tuples in three fields (anchorText,wikiUrl,context).
July 1-5
- Add position of mention
- Deal with redirect
- transform the URIs into DBpedia URIs format
- trying to find a way to normalize the intermediate data.
July 8-12
- try contact Sameer who maintain wiki-link dataset, and he finally distributed the first part full dataset in my request. The size of first part is about 100G and total size is about 180G.
- look into the full dataset and see if there is a way to cut down the size and discuss with Sameer.
July 21
I had a online discussion with Chris. With the help of Chris, I understood some details better. We came to a conclusion that context-only dataset is not enough to build a good model. I realized that I should not rely on UMass code and spent too much time on some details on their dataset. We would try to release a better version of database from his dataset or just implement a version that can extract information directly from network by URLs. Although it's not ideal, I would try to have a runnable version first on the code I had so far.
July 22
- Extend wikilink.pig to get token counts.
- Extend wikilink.pig to get pariCounts,uriCounts,sfAndTotalCounts.
- Build a model from the counts
- Evaluation the result
I use all contexts to calculate the total counts of surface form and the contexts have overlaps, so the annotation rate would be low. And for convenient, I only use the wikipedia dump to evaluate , so the result is bad.
Evaluation Result for wikilink
********************
Corpus: AnnotatedTextSource
Number of occs: 2926 (original), 2926 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
Correct URI not found = 687 / 2926 = 0.235
Accuracy = 1925 / 2926 = 0.658
Global MRR: 0.3389267075121255
Elapsed time: 26 sec
********************
SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
| actual Y | actual N
expected Y | 796 | 11784
expected N | 976 | N/A
precision: 0.449210 recall: 0.063275
--------------------------------
For comparision, I also run the evaluation for another model built from wikipedia dumps using the same test set.
evaluation result for wikipedia dump
********************
Corpus: AnnotatedTextSource
Number of occs: 3799 (original), 3799 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
Correct URI not found = 535 / 3799 = 0.141
Accuracy = 3090 / 3799 = 0.813
Global MRR: 0.46658653655054527
Elapsed time: 26 sec
********************
SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
| actual Y | actual N
expected Y | 7920 | 4660
expected N | 10694 | N/A
precision: 0.425486 recall: 0.629571
--------------------------------
July 23-24
- Looked into different html extractors, and couldn't find a perfect solution to extract article text.
- boilerpipe works fine for extracting article text from html. But it doesn't have good api to control the output format.
- Use boilerpipe HTMLHighlighter to generate Highlight html and try to process its generated html to get data we need.
here is an example for boilerpipe HTMLHighlighter input and output.
input: http://www.lsdimension.com/tag/electronic-music/
July 25-26
- Extract articleText from full context dataset
- get NR counts from articleText
- Build a model and evaluate the result
It took a lot of time and memory when using diskIntensiveNgram but it became better after using MemoryIntensiveNgram(about half an hours to process 1% dataset in one machine). But the result is still bad(recall is still too low). I think the reason is that the training size is too small and most surfaceForms in test set cannot be found in the model.
********************
Corpus: AnnotatedTextSource
Number of occs: 1523 (original), 1523 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
Correct URI not found = 494 / 1523 = 0.324
Accuracy = 914 / 1523 = 0.600
Global MRR: 0.19239949179876756
Elapsed time: 16 sec
********************
SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
| actual Y | actual N
expected Y | 458 | 12122
expected N | 615 | N/A
precision: 0.426841 recall: 0.036407
--------------------------------
July 27- August 2
- split test set from wikilink dataset
- rewrite wikilink.pig code, remove duplicate code.
- prepare mid-term evaluation
- train a better model by using more traing data
- evaluate my model using wikilink test set
The result improved but still not good enough. Try latter.
********************
SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
| actual Y | actual N
expected Y | 1135 | 10987
expected N | 1567 | N/A
precision: 0.420059 recall: 0.093631
--------------------------------
Considering mid-term evaluation has passed, I will start work on task2 and try to improve task1 in extra time.
Aug 5-9
- rewrite index_pig_udf using java
- reformat the pig code for generating graph into pignlproc
- run and build graph for a small dataset
Aug 12-16
- build graph from occurrence and co-occurrence count.
- run graph disambugulation
Aug 17-20
- rewrite a pig script to count occurrence in pignlproc
- rewrite pig script for co-occurrence count using macro
Aug 22-23
- rewrite a pig script to count occurrence in pignlproc
- rewrite pig script for co-occurrence count using macro
- store occurrence counts into json format
Aug 25-27
- uniform occurrence and co-occurrence graph generation
- use preprocessed result from pignlpro
- clean the code and remove unnecessary steps
The total time of counting and graph making reduced about 30% after this.
Aug 28-29
- prepare code to cut down the size of google corpus
Sept 2-3
- clean and commit the code
Sept 4-5
- resolve pom conflict with latest repository
- move the job for generating uriMap and surface form mapping into pignlproc
- prepare code ready to run in cluster
- use DBTwoStepDisambiguator to get prior instead of ICF
Sept 7-8
- integrate Graph base disambiguation with DB-back core.
Sept 9-11
- deal with incorrect wikipedia link
- organize HTMLHighlighter output in a better format
Sept 13-17
- clean code
- resolve conflict
- add comment
- build a model with large size of data using cluster
Sept 17-22
- clean code
- merge code
- commit code
- build model and run final evaluation
Final Evaluation Results: Wikipedia Spotlight Model and graph are built from English wikipedia page dump(about 100M) and test set is extract from it. And I only use a subset(about 10%) of disambiguates.nt, instances.nt, redirects.nt (it's slow to build the resource map and cause out of memory in both my computer and spotlight). So the result is not good but we can see their comparision. The efficiency of GraphBaseDisambigutor is obviously improved but the accuracy is slightly lower.
********************
Corpus: AnnotatedTextSource
Number of occs: 3270 (original), 3270 (processed)
Disambiguator: Common sense baseline disambiguator.
Correct URI not found = 569 / 3270 = 0.174
Accuracy = 2043 / 3270 = 0.625
Global MRR: 0.35801568828089403
Elapsed time: 3 sec
********************
Corpus: AnnotatedTextSource
Number of occs: 3270 (original), 3270 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture[P(e),P(c|e),P(s|e)])
Correct URI not found = 712 / 3270 = 0.218
Accuracy = 2070 / 3270 = 0.633
Global MRR: 0.35426925003682236
Elapsed time: 27 sec
********************
********************
Corpus: AnnotatedTextSource
Number of occs: 3270 (original), 3138 (processed)
Disambiguator: OldGraphBasedDisambiguator
Correct URI not found = 681 / 3138 = 0.217
Accuracy = 2091 / 3138 = 0.666
Global MRR: 0.36198845017358477
Elapsed time: 214 sec
********************
********************
Corpus: AnnotatedTextSource
Number of occs: 3270 (original), 3138 (processed)
Disambiguator: GraphBasedDisambiguator
Correct URI not found = 681 / 3138 = 0.217
Accuracy = 2032 / 3138 = 0.647
Global MRR: 0.34631832697274715
Elapsed time: 84 sec
********************
- SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
- | actual Y | actual N
- expected Y | 5070 | 7760
- expected N | 11525 | N/A
- precision: 0.305514 recall: 0.395168
---------------------------------
The Google Corpus model has two versions, the full text version use 10%(about 20g) of UMass data set and the context version use 50% of UMass version. Unlike wikipedia, free html has low annotation raw and most suface form is useless(I get about 2.5 million surface forms), so I only consider surface forms with annotated_count>=2 and annotated_count/total_count>=0.1 and get 0.8 million surface forms. I also remove uris that cannot be decoded correctly or has wrong wikipedia link format. A built model can be fould here, TSV counts.
Evaluation Results:
********************
Corpus: AnnotatedTextSource
Number of occs: 5798 (original), 5798 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture[P(e),P(c|e),P(s|e)])
Correct URI not found = 713 / 5798 = 0.123
Accuracy = 4447 / 5798 = 0.767
Global MRR: 0.5875105053663279
Elapsed time: 40 sec
********************
********************
Corpus: AnnotatedTextSource
Number of occs: 5798 (original), 5798 (processed)
Disambiguator: Common sense baseline disambiguator.
Correct URI not found = 664 / 5798 = 0.115
Accuracy = 4399 / 5798 = 0.759
Global MRR: 0.5871985373792586
Elapsed time: 3 sec
********************
SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
- | actual Y | actual N
- expected Y | 4307 | 8523
- expected N | 5054 | N/A
- precision: 0.460100 recall: 0.335698
- --------------------------------
TODOs:
-
Some wikipedia link format is incorrect, such as http://en.wikipedia.org/CDE and http://en.wikipedia.org/wik/Haganah. Currently I use a python script to remove wrong link.This can be done inside pignlpro using UDF or deal with it when building model.
-
Improve efficient to get ngrams from full page UMass data set. The page text is long so it may generate many ngrams but only very few of them is useful. Current version takes several hours to process 10% data and can only work using DiskIntensiveNgrams.
-
Current version of GraphBasedDisambiguator use DBTwoStepsDisambiguator to get Prior and use occurence of surface form to get initial importances of surface forms (similar to icf, a rare surface form may be more important to the semantic stucture of the paragraph). I tried different way to get initial evidence and they seems not affect the result much but I think more experiment should be done here.
-
Due to a bug inside it.unimi.dsi.webgraph pakage, "file not found" error may happen when making graphs and loading graphs. I read inside their source code and found that the underlyinggraph file pointed by *Graph.properties is misjudged as absolute path. A simple solution is to edit occsGraph.properties, occsTransposeGraph.properties, cooccsGraph.properties and semanticGraph.properties, remove "/" before underlyinggraph properties. We can choose to do this inside code or ask them to fix this bug.