Splitting occurrences into topics

Among the various interesting things you can do with a built index (in this case Lucene based core), one is splitting the extracted occurrences file into topics (if you dont have an index yet check the information here).

This task can be achieved with a few simple steps. First you have to define the topics. This is done in the topic description file (An example here), which is structured as follows. Note, that you currently do not have to assign mediatopics.

  <!-- Arts and entertainment-->

  <topic name="tvseries_animation_cartoon">  
    <iptc mediatopic="20000003"/>
    <iptc mediatopic="20000004"/>
    <categories>Animation,Cartooning</categories>
  </topic>

  <topic name="cinema">
    <iptc mediatopic="20000005"/>
    <categories>Film</categories>
  </topic>


  <topic name="literature">
    <iptc mediatopic="20000013"/>
    <categories>Literature</categories>
  </topic>
...
</topics>

The second step is to choose the existing wikipedia categories for each topic. This task should be done with care and for every topic there should be some representative categories. They are used to create an initial split, i.e. each occurence of a resource which is a member of one of these assigned categories is initially assigned to the respective topic.

After the topic description file has been created, some additional configurations have to be made in the conf/indexing.properties as the following snippet illustrates.

 org.dbpedia.spotlight.data.sortedArticlesCategories=/pathTo/sorted.article_categories_en.nt
 #only NaiveBayesTopicalClassifier up to now
 org.dbpedia.spotlight.topic.classifier.type=NaiveBayesTopicalClassifier
 org.dbpedia.spotlight.topic.description=conf/topic_descriptions.xml

After doing so, splitting can already be started with the following command in module "topical" (class).

mvn scala:run -Xmx4g -DmainClass=org.dbpedia.spotlight.topical.index.SplitOccsSemiSupervised "-DaddArgs=conf/indexing.properties /path/to/sortedOccsFile /path/to/some/tmp_dir assigning_threshold iterations outputDir(same_partition_as_tmp)"

The important parameters here are the threshold and the number of iterations. Assigning a threshold means that the user is determining the minimal confidence considered reliable in a classifier prediction. Tests on the initial split have shown that 0.3 is a good threshold. Here are some example thresholds with their respective percentage of examples scored above this threshold and the corresponding accuracy:

threshold - percentage of examples - accuracy on those examples    
0.01 - 1.0 - 0.66  
0.1 - 0.97 - 0.67  
0.2 - 0.73 - 0.74  
0.3 - 0.52- 0.8   
0.4 - 0.38 - 0.85 
0.5 - 0.29 - 0.89

The number of iterations can be low, e.g. 1 or 2. If it is one, then we use only the classifier trained on the initial split for splitting. If the number is 2, the classifier obtained in the first iteration will be used in the second iteration. The latter represents a better option, since the initial split is biased.

After the first iteration splitting will be done incrementally. This means that the algorithm only tries to assign unassigned occurrences with the newly trained classifier, keeping the previously assigned occurrences.

After the occurrences have been split into the output directory, one or more of them (by merging) can be used to build a spotlight model. This model will only annotate topically related terms. The duration of this process is about 9 hours for 2 iterations with threshold 0.3 with the example topic description file on a single core i7. Unfortunately this algorithm has not been parallelized yet. This process only works for the english language at the moment. If you wish to test it with another language, you'll need to modify the code a little and change the respective files accordingly.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting occurrences into topics

Clone this wiki locally