GSoC2012 Progress (Jo)

Development branch

GSoC 2012 Progress Report

Progress

May 21 - May 27, 2012

looked into databases (mostly NOSQL, embedded), see Database comparison (Google Doc, not entirely complete atm); current front-runner is Apache Cassandra, however BerkeleyDB makes a very good impression (unfortunately it has a Apache v2 incompatible license)

Jun 1

added 2 additional DBs to the evluation
first draft of interfaces
initial testing and implementation of JDBM My DB fork: https://github.com/jodaiber/dbpedia-spotlight-db

Jun 6

implemented Pablo's indexing interfaces for JDBM and started with in-memory indexing
began writing Sources for indexing from TSV

Jun 7

started MemoryResourceStore, MemoryTokenStore, extended ResourceStore interface to be able to query by URL, extended MemoryStoreIndexer
added DBpediaResourceSource

Jun 8 - Jun 10

finished working versions of FileSources and Storage (in-memory and disk-based) for SurfaceForm, DBpediaResource, Candidate Map
created OntologyTypeStore (part of ResourceStore)
started ContextStore
efficient (de-)serialization with Kryo

Some statistics on the in-Memory versions on my MacBook (8GB RAM):

Used Heap space with all 3 stores in memory ~1.6-1.7GB

Store	Startup	Disk space (not compressed)	Memory usage (no GC)
MemorySurfaceFormStore (thresh10-TRD)	9587ms	139MB	782MB
MemoryResourceStore (DBpedia, Freebase, Schema types)	18006ms	165MB	508MB
MemoryCandidateMapStore (no threshold)	5831ms	123MB	427MB

Jun 11+Jun 12

added TokenSource, working version of TokenStore+TSV indexing, added Tokenizer
added TokenOccurrenceSource+TSV indexing
extended calculations in DBTwoStepDisambiguator

Original plan

Apr 24 – May 21, 2012

branching/forking main repository
discussion of core architecture changes with Pablo N. Mendes and Max Jakob
getting to know and coordinate with other GSoC students

May 21 - May 27, 2012

evaluation and comparison of databases

May 27 - June 10, 2012

implementation and testing of database-backed storage using the best evaluated system
impl. of all probability calculations and smoothing

June 10 - June 24, 2012

initial tests and evaluation of the problems resulting from the changes
performance evaluations and improvements

June 24 - July 1, 2012

evaluation of features used for the entity mention model
computation of any required additional counts

July 9, 2012

MIDTERM EVALUATION

July 1 - July 13, 2012

extension of the new database structure for the LM
(for this block, I calculated more time than I expect so that possible complications with finishing the first part for the MIDTERM EVALUATION can be compensated for)

July 17 - July 25, 2012

Machine Learning Summer School in Lisbon

July 25 - August 13, 2012

smoothing, optimization and evaluation of the new language model

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Google Summer of Code - GSoC

2013

2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC2012 Progress (Jo)

GSoC 2012 Progress Report

Progress

May 21 - May 27, 2012

Jun 1

Jun 6

Jun 7

Jun 8 - Jun 10

Jun 11+Jun 12

Original plan

Apr 24 – May 21, 2012

May 21 - May 27, 2012

May 27 - June 10, 2012

June 10 - June 24, 2012

June 24 - July 1, 2012

July 9, 2012

July 1 - July 13, 2012

July 17 - July 25, 2012

July 25 - August 13, 2012

Clone this wiki locally