Investigate articles without QID #24

newsch · 2023-08-04T22:25:48Z

The schema for the wikipedia enterprise dumps lists the QID field (main_entity) as optional.

All articles should have a QID, but apparently there are cases where they don't.

It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:

[2023-08-04T17:58:48Z INFO  om_wikiparser] Page without wikidata qid: "Wiriadinata Airport" (https://en.wikipedia.org/wiki/Wiriadinata_Airport)
[2023-08-04T17:59:11Z INFO  om_wikiparser] Page without wikidata qid: "Uptown (Brisbane)" (https://en.wikipedia.org/wiki/Uptown_(Brisbane))

Both articles were edited on 2023-07-31, around when the dump was created:

Is this the main cause of these cases, or is there something else?

Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?

The text was updated successfully, but these errors were encountered:

newsch · 2023-08-11T17:04:07Z

Some more examples found while checking simplification:

[2023-08-11T15:15:52Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Springfield railway station (Scotland)" (https://en.wikipedia.org/wiki/Springfield_railway_station_(Scotland))
[2023-08-11T15:15:56Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Estevan Point" (https://en.wikipedia.org/wiki/Estevan_Point)
[2023-08-11T15:16:28Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Paredes Viejas Airport" (https://en.wikipedia.org/wiki/Paredes_Viejas_Airport)
[2023-08-11T15:16:32Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Magellan's Cross" (https://en.wikipedia.org/wiki/Magellan%27s_Cross)

The Springfield railway station (Scotland) was renamed on 2023-03-29, the content is the correct article html.

The Paredes Viejas Airport article was matched by "Marchigüe Paredes Viejas Airport", listed as a redirect in the 2023-04-01 dump. On 2023-03-24, the article was renamed from "Marchigüe Paredes Viejas Airport" to "Paredes Viejas Airport", and the corresponding wikidata item was updated. The article html was still relevant.

The Magellan's Cross and Estevan Point are different. Neither article was renamed around the time the dump was created, and the html in both is only the redirect page, not the main article content.

estevan_point.json.txt
magellans_cross.json.txt

biodranik · 2023-08-11T23:35:19Z

Maybe we may report this issue to people from Wikipedia? Or tag one of them here?

Vuizur · 2024-05-06T10:06:57Z

Maybe relevant: A user on Wikipedia/Wiktionary has been trying to get Wikimedia to fix errors with the enterprise dumps (such as quite a lot of missing pages) for two years now or so: https://phabricator.wikimedia.org/p/jberkel/

It's still broken now... (On Wiktionary they even considered that the best way forward might be scraping all pages and had a decent proof of concept, but with their slight rate limiting it took more than 2 days IIRC.)

biodranik · 2024-05-06T11:33:12Z

@newsch what do our most recent logs show? Are our errors related to that issue?

newsch · 2024-05-07T21:19:10Z

The logs won't report this.
I disregarded the missing pages issue initially, since the existing articles are left on the disk. The errors we log are from articles that aren't simplified, around 144 in the last run.

Outdated articles we can't handle right now, the idea I had was to sync article update time with the file metadata, and skip writing if it is older.

As for the duplicates, we have the <head> element heuristic right now but I don't think that catches everything. I need to do a run with debug logs to figure out what else we can do.

newsch mentioned this issue Aug 17, 2023

Download script #22

Merged

2 tasks

newsch added this to the v0.2 milestone Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate articles without QID #24

Investigate articles without QID #24

newsch commented Aug 4, 2023

newsch commented Aug 11, 2023

biodranik commented Aug 11, 2023

Vuizur commented May 6, 2024

biodranik commented May 6, 2024

newsch commented May 7, 2024

Investigate articles without QID #24

Investigate articles without QID #24

Comments

newsch commented Aug 4, 2023

newsch commented Aug 11, 2023

biodranik commented Aug 11, 2023

Vuizur commented May 6, 2024

biodranik commented May 6, 2024

newsch commented May 7, 2024