|
1 |
| -# CommonsCatOSM |
| 1 | +# CommonsChecker4OSM |
2 | 2 |
|
3 |
| -Finds OpenStreetMap elements tagged with `wikimedia_commons=Category:*` where the category does not exist on Wikimedia Commons. |
| 3 | +Finds OpenStreetMap elements tagged with `wikimedia_commons=Category:*` (or `wikimedia_commons=File:*`) where the category (or file) does not exist on Wikimedia Commons. |
| 4 | + |
| 5 | +## How does it work |
| 6 | + |
| 7 | +CommonsChecker4OSM filters a list of OSM elements tagged wtih `wikimedia_commons=*` against a list of valid Commons categories or files. It is basically an `fgrep -f` implementation that processes a 5 gigabyte pattern file efficiently using the [BBHash](https://github.com/rizkg/BBHash) minimal perfect hashing library. A Makefile is provided to download and prepare the input files. A Python script [main.py](./main.py) is included that may help you to understand the equivalent C++ code, but it loads the entire pattern file into RAM. |
4 | 8 |
|
5 | 9 | ## How to use
|
6 | 10 |
|
7 |
| -To minimize false positives, wait until Wikimedia releases a new data dump (usually at the third day of every month). You can get notified by subscribing to [the RSS feed](https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz-rss.xml). |
| 11 | +If you want to find invalid categories, you should wait until Wikimedia releases a new data dump (usually at the third day of every month) to minimize false positives. You can get notified by subscribing to [the RSS feed](https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz-rss.xml). If you want to find invalid files, you do not need to wait—the data dump is released every day. |
| 12 | + |
| 13 | +CommonsChecker4OSM has only been tested on Linux. It should work on Windows, but preparing the input files will be annoying. Step-by-step guide: |
8 | 14 |
|
9 | 15 | 1. Install [Osmium Tool](https://osmcode.org/osmium-tool/)
|
10 | 16 | 2. Clone this repository
|
11 |
| -3. Download required files into the same directory: |
| 17 | +3. Inside cloned repo, create `data` and `out` folders |
| 18 | +3. Download required files into the `data` directory: |
12 | 19 | * `planet.pbf`: an Osmium-compatible [OpenStreetMap planet](https://wiki.openstreetmap.org/wiki/Planet.osm)
|
13 |
| - * [commonswiki-latest-all-titles.gz](https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-all-titles.gz) (1.1 GB as of 2023) |
14 |
| -4. Choose an output format (see below). If you use a Unix-like operating system, you can run `make tsv` or `make geojson`. Otherwise, run the commands given below. |
| 20 | +4. Choose an output format (see below). Run either `make out/cats.tsv`, `make out/cats.geojson`, `make out/files.tsv`, or `make out/files.geojson` (output filename must be one of these hardcoded strings). |
15 | 21 |
|
16 | 22 | ## Output formats
|
17 | 23 |
|
18 | 24 | ### Tab-separated values (TSV)
|
19 | 25 |
|
20 |
| -Output contains invalid category names and the OpenStreetMap ID of the node, way, or relation with that category. This is the fastest format to produce. |
| 26 | +Output contains OSM identifier (node, way, or relation ID) and `wikimedia_commons` value, separated by a tab. This is the fastest format to produce. |
21 | 27 |
|
22 | 28 | ```console
|
23 |
| -# Make sure you have 10min to spare |
24 |
| -$ osmium tags-filter -R planet.pbf 'nwr/wikimedia_commons=Category:*' -o commonscats-in-osm.xml |
25 |
| -$ python main.py out.tsv |
26 |
| -$ cat out.tsv |
27 |
| -n/1573735855 Conservatoire_National_de_V%C3%A9hicules_Historiques |
28 |
| -n/286133524 ref:sprockhoff No. 465 |
29 |
| -n/3022117073 Wildwiesenwarte;Category:Views from the Wildwiesenwarte |
30 |
| -n/306593910 Prince George pub, Brighton Good pictures Advanced... |
31 |
| -n/6426478285 Dorfkirche_Mechow_(Kyritz)?uselang=de |
32 |
| -w/297069904 https://commons.wikimedia.org/wiki/Category:Gr%C3%BCner_Graben_14_(G%C3%B6rlitz) |
33 |
| -w/320276921 Ballyellen Upper Lock |
34 |
| -w/474166824 Nages-et-Solorgues#/media/File:Fontaine_Ranquet.jpg |
35 |
| -r/12931220 Brandenburger Straße 36;Riedelsberger Weg 2 (Bayreuth) |
| 29 | +$ make out/cats.tsv |
| 30 | +# osmium takes 10min for entire planet |
| 31 | +$ cat out/cats.tsv |
| 32 | +n/1573735855 Category:Conservatoire_National_de_V%C3%A9hicules_Historiques |
| 33 | +n/286133524 Category:ref:sprockhoff No. 465 |
| 34 | +n/3022117073 Category:Wildwiesenwarte;Category:Views from the Wildwiesenwarte |
| 35 | +n/306593910 Category:Prince George pub, Brighton Good pictures Advanced... |
| 36 | +n/6426478285 Category:Dorfkirche_Mechow_(Kyritz)?uselang=de |
| 37 | +w/297069904 Category:https://commons.wikimedia.org/wiki/Category:Gr%C3%BCner_Graben_14_(G%C3%B6rlitz) |
| 38 | +w/320276921 Category: Ballyellen Upper Lock |
| 39 | +w/474166824 Category:Nages-et-Solorgues#/media/File:Fontaine_Ranquet.jpg |
| 40 | +r/12931220 Category:Brandenburger Straße 36;Riedelsberger Weg 2 (Bayreuth) |
36 | 41 | ```
|
37 | 42 |
|
38 | 43 | ### Line-by-Line GeoJSON
|
39 | 44 |
|
40 | 45 | [This format](https://learn.maproulette.org/documentation/line-by-line-geojson/) can be used to create a challenge on maproulette.org. It might lack a few categories that are present in the TSV format.
|
41 | 46 |
|
42 | 47 | ```console
|
43 |
| -# Make sure you have 3GB RAM and 25min to spare |
44 |
| -$ osmium tags-filter -t planet.pbf 'nwr/wikimedia_commons=Category:*' -o planet-filtered.pbf |
45 |
| -$ osmium export planet-filtered.pbf -c config.json -o planet-filtered.geojson -f jsonseq |
46 |
| -$ python main.py out.geojson |
47 |
| -$ cat out.geojson |
48 |
| -{"type":"Feature","geometry":{"type":"Point","coordinates":[-2.0835284,53.3600557]},"properties":{"@type":"node","@id":29947059,"wikimedia_commons":"Category:Help Category:Middlewood railway station"}} |
49 |
| -{"type":"Feature","geometry":{"type":"LineString","coordinates":[[10.1212064,54.3247979],[10.120334100000001,54.3242942],[10.1192733,54.3236981],[10.1199922,54.3233703],[10.1204628,54.3231298],[10.1209243,54.3228965],[10.1222211,54.3236181],[10.1212064,54.3247979]]},"properties":{"@type":"way","@id":9408975,"wikimedia_commons":"Category:Wilhelmplatz (Kiel)"}} |
50 |
| -$ python to_maproulette.py out.geojson |
| 48 | +$ make out/cats.geojson |
| 49 | +# osmium takes 22min for entire planet |
| 50 | +$ cat cats.geojson |
51 | 51 | {"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Point","coordinates":[-2.0835284,53.3600557]},"properties":{"@type":"node","@id":29947059,"wikimedia_commons":"Category:Help Category:Middlewood railway station"}}]}
|
52 | 52 | {"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"LineString","coordinates":[[10.1212064,54.3247979],[10.120334100000001,54.3242942],[10.1192733,54.3236981],[10.1199922,54.3233703],[10.1204628,54.3231298],[10.1209243,54.3228965],[10.1222211,54.3236181],[10.1212064,54.3247979]]},"properties":{"@type":"way","@id":9408975,"wikimedia_commons":"Category:Wilhelmplatz (Kiel)"}}]}
|
53 | 53 | ```
|
| 54 | + |
| 55 | +## License |
| 56 | + |
| 57 | +CommonsChecker4OSM and BBHash are licensed under the MIT license: |
| 58 | + |
| 59 | +Copyright (c) 2015 Guillaume Rizk (BBHash) |
| 60 | +Copyright (c) 2023 Lennard Hofmann (CommonsChecker4OSM) |
| 61 | + |
| 62 | +Permission is hereby granted, free of charge, to any person obtaining a copy |
| 63 | +of this software and associated documentation files (the "Software"), to deal |
| 64 | +in the Software without restriction, including without limitation the rights |
| 65 | +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| 66 | +copies of the Software, and to permit persons to whom the Software is |
| 67 | +furnished to do so, subject to the following conditions: |
| 68 | + |
| 69 | +The above copyright notice and this permission notice shall be included in all |
| 70 | +copies or substantial portions of the Software. |
| 71 | + |
| 72 | +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| 73 | +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| 74 | +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| 75 | +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| 76 | +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| 77 | +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
| 78 | +SOFTWARE. |
0 commit comments