GSoC2013 Progress (Zhiwei)

##About Github Link:

spotlight github (master branch and collective branch)

Proposal Link: gsoc

###Project Short Description My project is a combination of two parts, Generalize input formats and add support for Google mention corpus and Efficient graph-based disambiguation and general performance improvements. The first part of the project aims to support more input format for training the model, which may improve Spotlight performance and widen usage range. The second part of the project is about integration and optimization of the graph-based disambiguation. Graph-based disambiguation is implemented last year but still not integrate with Spotlight master branch so we haven't take the advantages of graph-base disambiguation . In this part, I will integrate graph-base disambiguation with DB-backed back-end.

###Steps to run my code:

Build Spotlight Model from Google Corpus

Get my code here.
Compile and get a jar. mvn assembly:assembly -Dmaven.test.skip=true
Get part of dataSet-context-only or dataSet-full

Put them into HDFS and edit wikilink.pig.param to point to the folder containing .gz files.

cd example/index

if use context-only data set

     pig -m wikilink.pig.params wikilink_token_counts.pig
     pig -m wikilink.pig.params wikilink_context_NRs.pig

if use full data set

     pig -m wikilink.pig.params wikilink_token_counts.pig
     pig -m wikilink.pig.params wikilink_full_NRs.pig

note: If use hadoop, only pig in version 0.10 can work. Otherwise, something like "invalid stream header" error might occurs(I don't know the reason yet).

Merge the results into a local folder and combine them into four files "pairCounts","sfAndTotalCounts","tokenCounts", "uriCounts"

Create a model from it.

  java -jar dbpedia-spotlight.jar org.dbpedia.spotlight.db.CreateSpotlightModel en_US /data/spotlight/nl/processed/ /data/spotlight/nl/model_nl /data/spotlight/nl/opennlp /data/spotlight/nl/stopwords.nl.list EnglishStemmer

A built model can be fould here, TSV counts.

I update steps to run make graph and run graphBaseDisambiguator in steps for graph disambiguation

##Progress ###Applying Period(April 10-May 3) 1.Set up development environment for Spotlight and pignlproc with intellij.

2.Created SpotlightModel using both pignlproc and lucene. Ran Spotlight to annotate some text. Figured out the code structure of pignlproc and Spotlight.

3.Ran Graph-based Disambiguation Model in Hector's repository. Figured out the code structure of Graph-based Disambiguation.

4.Extended DBpediaResourceSource and CandidateMapSource to get uriCount and candidateMap from Wiki-link data.

###Community Bonding Period(May 27-June 16)

1.Read documents about Hadoop.

2.Set up a cluster of 3 small computer with hadoop.(June 5th)

3.Read documents about PigLatin and UDF.(June 11th)

4.Read some documents about RDF.

###Coding Period(June 17-Sept 22) Branch: https://github.com/caizhiwei/pignlproc/tree/wiki-link

June 17-18

Fugure out the structure of LoadFunc and FileInputFormat.
Implement a skeleton of WikilinkLoader and WikilinkInputFormat.

June 19-20

The APIs provided by UMass may not works in hadoop since they only support iteration reading from local file. Try to read inside the wiki-link code and find a way to read in hadoop FS.
Solve some compatibility between Scala code(provided by UMass) and java.

June 21

Implement a rougth version of LoadFunc and InputFormat for wiki-link Dataset.

June 24-27

Prepare for final exam this week. I will try to keep on progress using this weekend.

June 28

Extract context in sentence format.
Write testing code for WikiLinkInputFormat and WikiLinkLoader.

June 29

Simple processing code for wiki-link in pigLatin.

https://github.com/caizhiwei/pignlproc/blob/wiki-link/examples/indexing/wikilink.pig

idAndMentions contains two field (docId,mentions) and mentions is a dataBag containing tuples in three fields (anchorText,wikiUrl,context).

July 1-5

Add position of mention
Deal with redirect
transform the URIs into DBpedia URIs format
trying to find a way to normalize the intermediate data.

July 8-12

try contact Sameer who maintain wiki-link dataset, and he finally distributed the first part full dataset in my request. The size of first part is about 100G and total size is about 180G.
look into the full dataset and see if there is a way to cut down the size and discuss with Sameer.

July 21

I had a online discussion with Chris. With the help of Chris, I understood some details better. We came to a conclusion that context-only dataset is not enough to build a good model. I realized that I should not rely on UMass code and spent too much time on some details on their dataset. We would try to release a better version of database from his dataset or just implement a version that can extract information directly from network by URLs. Although it's not ideal, I would try to have a runnable version first on the code I had so far.

July 22

Extend wikilink.pig to get token counts.
Extend wikilink.pig to get pariCounts,uriCounts,sfAndTotalCounts.
Build a model from the counts
Evaluation the result

I use all contexts to calculate the total counts of surface form and the contexts have overlaps, so the annotation rate would be low. And for convenient, I only use the wikipedia dump to evaluate , so the result is bad.

     Evaluation Result for wikilink
     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 2926 (original), 2926 (processed)
     Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
     Correct URI not found = 687 / 2926 = 0.235
     Accuracy = 1925 / 2926 = 0.658
     Global MRR: 0.3389267075121255
     Elapsed time: 26 sec
     ********************
     SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
                 | actual Y  | actual N
     expected Y |   796     |    11784
     expected N |   976     |    N/A
     precision: 0.449210  recall: 0.063275
     --------------------------------

For comparision, I also run the evaluation for another model built from wikipedia dumps using the same test set.

     evaluation result for wikipedia dump
     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 3799 (original), 3799 (processed)
     Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
     Correct URI not found = 535 / 3799 = 0.141
     Accuracy = 3090 / 3799 = 0.813
     Global MRR: 0.46658653655054527
     Elapsed time: 26 sec
     ********************
     SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
                 | actual Y  | actual N
     expected Y |   7920     |    4660
     expected N |   10694     |    N/A
     precision: 0.425486  recall: 0.629571
     --------------------------------

July 23-24

Looked into different html extractors, and couldn't find a perfect solution to extract article text.
boilerpipe works fine for extracting article text from html. But it doesn't have good api to control the output format.
Use boilerpipe HTMLHighlighter to generate Highlight html and try to process its generated html to get data we need.

here is an example for boilerpipe HTMLHighlighter input and output.

input: http://www.lsdimension.com/tag/electronic-music/

output: http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.lsdimension.com%2Ftag%2Felectronic-music%2F&extractor=ArticleExtractor&output=htmlFragment&extractImages=

July 25-26

Extract articleText from full context dataset
get NR counts from articleText
Build a model and evaluate the result

It took a lot of time and memory when using diskIntensiveNgram but it became better after using MemoryIntensiveNgram(about half an hours to process 1% dataset in one machine). But the result is still bad(recall is still too low). I think the reason is that the training size is too small and most surfaceForms in test set cannot be found in the model.

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 1523 (original), 1523 (processed)
     Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity,UnweightedMixture[P(e),P(c|e),P(s|e)])
     Correct URI not found = 494 / 1523 = 0.324
     Accuracy = 914 / 1523 = 0.600
     Global MRR: 0.19239949179876756
     Elapsed time: 16 sec
     ********************
     SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
                | actual Y  | actual N
     expected Y |   458     |    12122
     expected N |   615     |    N/A
     precision: 0.426841  recall: 0.036407
     --------------------------------

July 27- August 2

split test set from wikilink dataset
rewrite wikilink.pig code, remove duplicate code.
prepare mid-term evaluation
train a better model by using more traing data
evaluate my model using wikilink test set

The result improved but still not good enough. Try latter.

     ********************
     SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
                | actual Y  | actual N
     expected Y |   1135    |    10987
     expected N |   1567    |    N/A
     precision: 0.420059  recall: 0.093631
     --------------------------------

Considering mid-term evaluation has passed, I will start work on task2 and try to improve task1 in extra time.

Aug 5-9

rewrite index_pig_udf using java
reformat the pig code for generating graph into pignlproc
run and build graph for a small dataset

Aug 12-16

build graph from occurrence and co-occurrence count.
run graph disambugulation

Aug 17-20

rewrite a pig script to count occurrence in pignlproc
rewrite pig script for co-occurrence count using macro

Aug 22-23

rewrite a pig script to count occurrence in pignlproc
rewrite pig script for co-occurrence count using macro
store occurrence counts into json format

Aug 25-27

uniform occurrence and co-occurrence graph generation
use preprocessed result from pignlpro
clean the code and remove unnecessary steps

The total time of counting and graph making reduced about 30% after this.

Aug 28-29

prepare code to cut down the size of google corpus

Sept 2-3

clean and commit the code

Sept 4-5

resolve pom conflict with latest repository
move the job for generating uriMap and surface form mapping into pignlproc
prepare code ready to run in cluster
use DBTwoStepDisambiguator to get prior instead of ICF

Sept 7-8

integrate Graph base disambiguation with DB-back core.

Sept 9-11

deal with incorrect wikipedia link
organize HTMLHighlighter output in a better format

Sept 13-17

clean code
resolve conflict
add comment
build a model with large size of data using cluster

Sept 17-22

clean code
merge code
commit code
build model and run final evaluation

Final Evaluation Results: Wikipedia Spotlight Model and graph are built from English wikipedia page dump(about 100M) and test set is extract from it. And I only use a subset(about 10%) of disambiguates.nt, instances.nt, redirects.nt (it's slow to build the resource map and cause out of memory in both my computer and spotlight). So the result is not good but we can see their comparision. The efficiency of GraphBaseDisambigutor is obviously improved but the accuracy is slightly lower.

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 3270 (original), 3270 (processed)
     Disambiguator: Common sense baseline disambiguator.
     Correct URI not found = 569 / 3270 = 0.174
     Accuracy = 2043 / 3270 = 0.625
     Global MRR: 0.35801568828089403
     Elapsed time: 3 sec

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 3270 (original), 3270 (processed)
     Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture[P(e),P(c|e),P(s|e)])
     Correct URI not found = 712 / 3270 = 0.218
     Accuracy = 2070 / 3270 = 0.633
     Global MRR: 0.35426925003682236
     Elapsed time: 27 sec
     ******************** 

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 3270 (original), 3138 (processed)
     Disambiguator: OldGraphBasedDisambiguator
     Correct URI not found = 681 / 3138 = 0.217
     Accuracy = 2091 / 3138 = 0.666
     Global MRR: 0.36198845017358477
     Elapsed time: 214 sec
     ********************

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 3270 (original), 3138 (processed)
     Disambiguator: GraphBasedDisambiguator
     Correct URI not found = 681 / 3138 = 0.217
     Accuracy = 2032 / 3138 = 0.647
     Global MRR: 0.34631832697274715
     Elapsed time: 84 sec
     ********************

     - SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
     -             | actual Y  | actual N
     - expected Y |   5070     |    7760
     - expected N |   11525     |    N/A
     - precision: 0.305514  recall: 0.395168
     ---------------------------------

The Google Corpus model has two versions, the full text version use 10%(about 20g) of UMass data set and the context version use 50% of UMass version. Unlike wikipedia, free html has low annotation raw and most suface form is useless(I get about 2.5 million surface forms), so I only consider surface forms with annotated_count>=2 and annotated_count/total_count>=0.1 and get 0.8 million surface forms. I also remove uris that cannot be decoded correctly or has wrong wikipedia link format. A built model can be fould here, TSV counts.

Evaluation Results:

     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 5798 (original), 5798 (processed)
     Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture[P(e),P(c|e),P(s|e)])
     Correct URI not found = 713 / 5798 = 0.123
     Accuracy = 4447 / 5798 = 0.767
     Global MRR: 0.5875105053663279
     Elapsed time: 40 sec
     ********************
     ********************
     Corpus: AnnotatedTextSource
     Number of occs: 5798 (original), 5798 (processed)
     Disambiguator: Common sense baseline disambiguator.
     Correct URI not found = 664 / 5798 = 0.115
     Accuracy = 4399 / 5798 = 0.759
     Global MRR: 0.5871985373792586
     Elapsed time: 3 sec
     ********************
     SpotterWrapper[OpenNLPSpotter] and corpus AnnotatedTextSource
     -            | actual Y  | actual N
     - expected Y |   4307     |    8523
     - expected N |   5054     |    N/A
     - precision: 0.460100  recall: 0.335698
     - --------------------------------

TODOs:

Some wikipedia link format is incorrect, such as http://en.wikipedia.org/CDE and http://en.wikipedia.org/wik/Haganah. Currently I use a python script to remove wrong link.This can be done inside pignlpro using UDF or deal with it when building model.
Improve efficient to get ngrams from full page UMass data set. The page text is long so it may generate many ngrams but only very few of them is useful. Current version takes several hours to process 10% data and can only work using DiskIntensiveNgrams.
Current version of GraphBasedDisambiguator use DBTwoStepsDisambiguator to get Prior and use occurence of surface form to get initial importances of surface forms (similar to icf, a rare surface form may be more important to the semantic stucture of the paragraph). I tried different way to get initial evidence and they seems not affect the result much but I think more experiment should be done here.
Due to a bug inside it.unimi.dsi.webgraph pakage, "file not found" error may happen when making graphs and loading graphs. I read inside their source code and found that the underlyinggraph file pointed by *Graph.properties is misjudged as absolute path. A simple solution is to edit occsGraph.properties, occsTransposeGraph.properties, cooccsGraph.properties and semanticGraph.properties, remove "/" before underlyinggraph properties. We can choose to do this inside code or ask them to fix this bug.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Google Summer of Code - GSoC

2013

2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC2013 Progress (Zhiwei)

Clone this wiki locally