Skip to content

Parses wikimedia xml dumps to text format, removes metadata etc.

Notifications You must be signed in to change notification settings

jukujala/wiki_markup_to_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

* What?
Parses wikipedia dumps to "text" format.
Output has one article per line, in format "title tab string-escaped article content"
String-escaped means that control characters such as new line are written as \n.

* Usage: cat wiki.xml | python parse_wiki_markup.py > corpus.txt

* How to read output in python:
  title, content = line.split("\t")
  content = content.decode("string-escape")

* Example of output:
Anarchism       \n\n\nAnarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes ...

* Input for this output was:
  <page>
    <title>Anarchism</title>
    <id>12</id>
    <revision>
      <id>442817224</id>
      <timestamp>2011-08-03T09:10:07Z</timestamp>
      <contributor>
        <username>Eduen</username>
        <id>7527773</id>
      </contributor>
      <comment>Emma Goldman identifying anarchy as more than no state</comment>
      <text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Anarchism sidebar}}

'''Anarchism''' is a [[political philosophy]] which considers the [[state (polity)|state]] undesirable, unnecessary, and harmful, and instead promotes a [[stateless society]], or [[anarchy]].&lt;ref name=&quot;definition&quot;&gt;
...

About

Parses wikimedia xml dumps to text format, removes metadata etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages