Skip to content

Commit 1ded838

Browse files
committed
Initial commit
0 parents  commit 1ded838

7 files changed

+201
-0
lines changed

.gitignore

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
commonscats-in-commons.txt
2+
commonscats-in-osm.xml
3+
*.gz
4+
*.geojson
5+
*.tsv
6+
*.pbf

LICENSE

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Copyright (c) 2023 Lennard Hofmann
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in all
11+
copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19+
SOFTWARE.

Makefile

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
.PHONY: geojson
2+
geojson: out.geojson out-maproulette.geojson
3+
4+
.PHONY: tsv
5+
tsv: out.tsv
6+
7+
planet.pbf:
8+
@echo "Missing planet.pbf: See https://wiki.openstreetmap.org/wiki/Planet.osm#Downloading for how to download an Osmium-compatible planet file." >&2; exit 1
9+
# You could add a command here to download planet.pbf with a BitTorrent client
10+
11+
commonswiki-all-titles.gz: FORCE
12+
curl -z commonswiki-all-titles.gz https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz -o commonswiki-all-titles.gz
13+
14+
commonscats-in-commons.txt: commonswiki-all-titles.gz
15+
# namespace 14 is "Category:"
16+
zcat commonswiki-all-titles.gz | grep ^14 | cut -f2 > commonscats-in-commons.txt
17+
18+
commonscats-in-osm.xml: planet.pbf
19+
osmium tags-filter -R planet.pbf 'nwr/wikimedia_commons=Category:*' -o commonscats-in-osm.xml --overwrite
20+
21+
planet-filtered.geojson: planet.pbf
22+
osmium tags-filter -t planet.pbf 'nwr/wikimedia_commons=Category:*' -o planet-filtered.pbf --overwrite
23+
osmium export planet-filtered.pbf -c config.json -o planet-filtered.geojson -f jsonseq --overwrite
24+
25+
out.tsv: commonscats-in-osm.xml commonscats-in-commons.txt
26+
./main.py out.tsv
27+
28+
out.geojson: planet-filtered.geojson commonscats-in-commons.txt
29+
./main.py out.geojson
30+
31+
out-maproulette.geojson: out.geojson
32+
./to_maproulette.py out.geojson > out-maproulette.geojson
33+
34+
FORCE: ;

README.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# CommonsCatOSM
2+
3+
Finds OpenStreetMap elements tagged with `wikimedia_commons=Category:*` where the category does not exist on Wikimedia Commons.
4+
5+
## How to use
6+
7+
To minimize false positives, wait until Wikimedia releases a new data dump (usually at the third day of every month). You can get notified by subscribing to [the RSS feed](https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz-rss.xml).
8+
9+
1. Install [Osmium Tool](https://osmcode.org/osmium-tool/)
10+
2. Clone this repository
11+
3. Download required files into the same directory:
12+
* `planet.pbf`: an Osmium-compatible [OpenStreetMap planet](https://wiki.openstreetmap.org/wiki/Planet.osm)
13+
* [commonswiki-latest-all-titles.gz](https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz) (1.1 GB as of 2023)
14+
4. Choose an output format (see below). If you use a Unix-like operating system, you can run `make tsv` or `make geojson`. Otherwise, run the commands given below.
15+
16+
## Output formats
17+
18+
### Tab-separated values (TSV)
19+
20+
Output contains invalid category names and the OpenStreetMap ID of the node, way, or relation with that category. This is the fastest format to produce.
21+
22+
```console
23+
# Make sure you have 10min to spare
24+
$ osmium tags-filter -R planet.pbf 'nwr/wikimedia_commons=Category:*' -o commonscats-in-osm.xml
25+
$ python main.py out.tsv
26+
$ cat out.tsv
27+
n/1573735855 Conservatoire_National_de_V%C3%A9hicules_Historiques
28+
n/286133524 ref:sprockhoff No. 465
29+
n/3022117073 Wildwiesenwarte;Category:Views from the Wildwiesenwarte
30+
n/306593910 Prince George pub, Brighton Good pictures Advanced...
31+
n/6426478285 Dorfkirche_Mechow_(Kyritz)?uselang=de
32+
w/297069904 https://commons.wikimedia.org/wiki/Category:Gr%C3%BCner_Graben_14_(G%C3%B6rlitz)
33+
w/320276921 Ballyellen Upper Lock
34+
w/474166824 Nages-et-Solorgues#/media/File:Fontaine_Ranquet.jpg
35+
r/12931220 Brandenburger Straße 36;Riedelsberger Weg 2 (Bayreuth)
36+
```
37+
38+
### Line-by-Line GeoJSON
39+
40+
[This format](https://learn.maproulette.org/documentation/line-by-line-geojson/) can be used to create a challenge on maproulette.org. It might lack a few categories that are present in the TSV format.
41+
42+
```console
43+
# Make sure you have 3GB RAM and 25min to spare
44+
$ osmium tags-filter -t planet.pbf 'nwr/wikimedia_commons=Category:*' -o planet-filtered.pbf
45+
$ osmium export planet-filtered.pbf -c config.json -o planet-filtered.geojson -f jsonseq
46+
$ python main.py out.geojson
47+
$ cat out.geojson
48+
{"type":"Feature","geometry":{"type":"Point","coordinates":[-2.0835284,53.3600557]},"properties":{"@type":"node","@id":29947059,"wikimedia_commons":"Category:Help Category:Middlewood railway station"}}
49+
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[10.1212064,54.3247979],[10.120334100000001,54.3242942],[10.1192733,54.3236981],[10.1199922,54.3233703],[10.1204628,54.3231298],[10.1209243,54.3228965],[10.1222211,54.3236181],[10.1212064,54.3247979]]},"properties":{"@type":"way","@id":9408975,"wikimedia_commons":"Category:Wilhelmplatz (Kiel)"}}
50+
$ python to_maproulette.py out.geojson
51+
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Point","coordinates":[-2.0835284,53.3600557]},"properties":{"@type":"node","@id":29947059,"wikimedia_commons":"Category:Help Category:Middlewood railway station"}}]}
52+
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"LineString","coordinates":[[10.1212064,54.3247979],[10.120334100000001,54.3242942],[10.1192733,54.3236981],[10.1199922,54.3233703],[10.1204628,54.3231298],[10.1209243,54.3228965],[10.1222211,54.3236181],[10.1212064,54.3247979]]},"properties":{"@type":"way","@id":9408975,"wikimedia_commons":"Category:Wilhelmplatz (Kiel)"}}]}
53+
```

config.json

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"attributes": {
3+
"type": true,
4+
"id": true,
5+
"version": false,
6+
"changeset": false,
7+
"timestamp": false,
8+
"uid": false,
9+
"user": false,
10+
"way_nodes": false
11+
},
12+
"format_options": {
13+
},
14+
"linear_tags": true,
15+
"area_tags": true,
16+
"exclude_tags": [],
17+
"include_tags": ["wikimedia_commons=Category:*"]
18+
}

main.py

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
#!/usr/bin/env python3
2+
import xml.etree.ElementTree as ET
3+
import gzip
4+
import sys
5+
import json
6+
7+
VALID_CATS = "commonscats-in-commons.txt"
8+
9+
def load_valid_categories():
10+
cats = set()
11+
try:
12+
with open(VALID_CATS) as f:
13+
for line in f:
14+
cats.add(line.rstrip('\n'))
15+
except FileNotFoundError:
16+
with gzip.open("commonswiki-latest-all-titles.gz", "rt") as f_in, open(VALID_CATS, "w") as f_out:
17+
for line in f_in:
18+
ns, category_name = line.split('\t', 1)
19+
if ns == '14': # "Category:" namespace
20+
cats.add(category_name.rstrip('\n'))
21+
f_out.write(category_name)
22+
return cats
23+
24+
25+
def tsv(cats, outfile):
26+
with open(outfile, "w") as f:
27+
tree = ET.parse("commonscats-in-osm.xml")
28+
for elem in tree.findall(".//*tag[@k='wikimedia_commons']/.."):
29+
wikimedia_commons = elem.find("tag[@k='wikimedia_commons']").get('v').removeprefix("Category:")
30+
if wikimedia_commons.replace(' ', '_') not in cats:
31+
f.write(f"{elem.tag[:1]}/{elem.get('id')}\t{wikimedia_commons}\n")
32+
33+
34+
def geojson(cats, outfile):
35+
with open(outfile, "wb") as f:
36+
for line in open("planet-filtered.geojson", "rb"):
37+
wikimedia_commons = json.loads(line[1:])["properties"]["wikimedia_commons"]
38+
if wikimedia_commons.removeprefix("Category:").replace(' ', '_') not in cats:
39+
f.write(line)
40+
41+
42+
def usage():
43+
print("""Usage: ./main.py outfile.geojson
44+
./main.py outfile.tsv""",
45+
file=sys.stderr)
46+
exit(1)
47+
48+
49+
def main():
50+
if len(sys.argv) != 2:
51+
usage()
52+
53+
outfile = sys.argv[1]
54+
if outfile.endswith(".tsv"):
55+
tsv(load_valid_categories(), outfile)
56+
elif outfile.endswith(".geojson"):
57+
geojson(load_valid_categories(), outfile)
58+
else:
59+
usage()
60+
61+
62+
if __name__ == "__main__":
63+
main()

to_maproulette.py

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env python3
2+
import sys
3+
4+
infile = sys.argv[1]
5+
fmtstr = b'\x1e{"type":"FeatureCollection","features":[%s]}\n'
6+
7+
for line in open(infile, "rb"):
8+
sys.stdout.buffer.write(fmtstr % line[1:].rstrip(b'\n'))

0 commit comments

Comments
 (0)