forked from dbpedia-spotlight/dbpedia-spotlight
-
Notifications
You must be signed in to change notification settings - Fork 0
Raw data
Jo Daiber edited this page Mar 2, 2014
·
4 revisions
We provide the raw data that is used to create entity extraction models in various languages. This data is the result of running pignlproc on the latest Wikipedia dumps.
If you use this data in your research, please cite the following paper:
@inproceedings{isem2013daiber,
title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
year = {2013},
booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}
- URIs are encoded in DBpedia format but redirects are not yet resolved. To do this, you can use the class WikipediaToDBpediaClosure, as used in
CreateSpotlightModel.scala
:
val wikipediaToDBpediaClosure = new WikipediaToDBpediaClosure(
namespace,
new FileInputStream(new File(rawDataFolder, "redirects.nt")),
new FileInputStream(new File(rawDataFolder, "disambiguations.nt"))
)
DBpedia URI Count
--------------------------------------------------------------
http://en.dbpedia.org/resource/1 21
http://en.dbpedia.org/resource/7 7
http://en.dbpedia.org/resource/C 20
Surface form Count annotated Count total
--------------------------------------------------------------
Berlin 49068 105915
Berloz 2 6
9z -1 1
- if no total string occurrence count is available, the 3rd column may be empty or -1
- if the annotated count is
-1
, then this is not a surface form that has been observed with any DBpedia resource. Lines with an annotated count of -1 are there to output the total count of the lowercase representations of surface forms.
Surface form DBpedia URI Count
----------------------------------------------------------------------------
Berlin http://en.dbpedia.org/resource/Brent_Berlin 2
Berlin http://en.dbpedia.org/resource/Trams_in_Berlin 9
Berlin http://en.dbpedia.org/resource/Berlin_(Seedorf) 1
Wikipedia URI Stemmed token counts
----------------------------------------------------------------------------
http://en.wikipedia.org/wiki/! {(renam,76),(intel,14),...,(plai,2),(auf,2)}
- All tokens are stemmed
Project
- Introduction
- Glossary
- User's manual
- Web application
- Installation
- Internationalization
- Licenses
- Researcher
- How to cite
- Support and Feedback
- Troubleshooting
- Team
- Acknowledgements
Statistical backend
Lucene backend
- Introduction
- Downloads
- Architecture
- Internationalization
- Web service parameters / API
- Splitting occurrences into topics
Developers