-
Notifications
You must be signed in to change notification settings - Fork 0
jukujala/wiki_markup_to_text
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
* What? Parses wikipedia dumps to "text" format. Output has one article per line, in format "title tab string-escaped article content" String-escaped means that control characters such as new line are written as \n. * Usage: cat wiki.xml | python parse_wiki_markup.py > corpus.txt * How to read output in python: title, content = line.split("\t") content = content.decode("string-escape") * Example of output: Anarchism \n\n\nAnarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes ... * Input for this output was: <page> <title>Anarchism</title> <id>12</id> <revision> <id>442817224</id> <timestamp>2011-08-03T09:10:07Z</timestamp> <contributor> <username>Eduen</username> <id>7527773</id> </contributor> <comment>Emma Goldman identifying anarchy as more than no state</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}} {{Redirect|Anarchists}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] which considers the [[state (polity)|state]] undesirable, unnecessary, and harmful, and instead promotes a [[stateless society]], or [[anarchy]].<ref name="definition"> ...
About
Parses wikimedia xml dumps to text format, removes metadata etc.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published