-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
34 lines (27 loc) · 1.27 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
* What?
Parses wikipedia dumps to "text" format.
Output has one article per line, in format "title tab string-escaped article content"
String-escaped means that control characters such as new line are written as \n.
* Usage: cat wiki.xml | python parse_wiki_markup.py > corpus.txt
* How to read output in python:
title, content = line.split("\t")
content = content.decode("string-escape")
* Example of output:
Anarchism \n\n\nAnarchism is a political philosophy which considers the state undesirable, unnecessary, and harmful, and instead promotes ...
* Input for this output was:
<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>442817224</id>
<timestamp>2011-08-03T09:10:07Z</timestamp>
<contributor>
<username>Eduen</username>
<id>7527773</id>
</contributor>
<comment>Emma Goldman identifying anarchy as more than no state</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] which considers the [[state (polity)|state]] undesirable, unnecessary, and harmful, and instead promotes a [[stateless society]], or [[anarchy]].<ref name="definition">
...