forked from dbpedia-spotlight/dbpedia-spotlight
-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC2012 Progress (Jo)
sandroacoelho edited this page Jul 26, 2013
·
1 revision
- looked into databases (mostly NOSQL, embedded), see Database comparison (Google Doc, not entirely complete atm); current front-runner is Apache Cassandra, however BerkeleyDB makes a very good impression (unfortunately it has a Apache v2 incompatible license)
- added 2 additional DBs to the evluation
- first draft of interfaces
- initial testing and implementation of JDBM My DB fork: https://github.com/jodaiber/dbpedia-spotlight-db
- implemented Pablo's indexing interfaces for JDBM and started with in-memory indexing
- began writing Sources for indexing from TSV
- started MemoryResourceStore, MemoryTokenStore, extended ResourceStore interface to be able to query by URL, extended MemoryStoreIndexer
- added DBpediaResourceSource
-
finished working versions of FileSources and Storage (in-memory and disk-based) for SurfaceForm, DBpediaResource, Candidate Map
-
created OntologyTypeStore (part of ResourceStore)
-
started ContextStore
-
efficient (de-)serialization with Kryo
-
Some statistics on the in-Memory versions on my MacBook (8GB RAM):
Used Heap space with all 3 stores in memory ~1.6-1.7GB
Store Startup Disk space (not compressed) Memory usage (no GC) MemorySurfaceFormStore (thresh10-TRD) 9587ms 139MB 782MB MemoryResourceStore (DBpedia, Freebase, Schema types) 18006ms 165MB 508MB MemoryCandidateMapStore (no threshold) 5831ms 123MB 427MB
- added TokenSource, working version of TokenStore+TSV indexing, added Tokenizer
- added TokenOccurrenceSource+TSV indexing
- extended calculations in DBTwoStepDisambiguator
- branching/forking main repository
- discussion of core architecture changes with Pablo N. Mendes and Max Jakob
- getting to know and coordinate with other GSoC students
- evaluation and comparison of databases
- implementation and testing of database-backed storage using the best evaluated system
- impl. of all probability calculations and smoothing
- initial tests and evaluation of the problems resulting from the changes
- performance evaluations and improvements
- evaluation of features used for the entity mention model
- computation of any required additional counts
- MIDTERM EVALUATION
- extension of the new database structure for the LM
- (for this block, I calculated more time than I expect so that possible complications with finishing the first part for the MIDTERM EVALUATION can be compensated for)
- Machine Learning Summer School in Lisbon
- smoothing, optimization and evaluation of the new language model