Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate articles without QID #24

Open
newsch opened this issue Aug 4, 2023 · 5 comments
Open

Investigate articles without QID #24

newsch opened this issue Aug 4, 2023 · 5 comments
Milestone

Comments

@newsch
Copy link
Collaborator

newsch commented Aug 4, 2023

The schema for the wikipedia enterprise dumps lists the QID field (main_entity) as optional.

All articles should have a QID, but apparently there are cases where they don't.

It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:

[2023-08-04T17:58:48Z INFO  om_wikiparser] Page without wikidata qid: "Wiriadinata Airport" (https://en.wikipedia.org/wiki/Wiriadinata_Airport)
[2023-08-04T17:59:11Z INFO  om_wikiparser] Page without wikidata qid: "Uptown (Brisbane)" (https://en.wikipedia.org/wiki/Uptown_(Brisbane))

Both articles were edited on 2023-07-31, around when the dump was created:

Is this the main cause of these cases, or is there something else?

Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?

@newsch
Copy link
Collaborator Author

newsch commented Aug 11, 2023

Some more examples found while checking simplification:

[2023-08-11T15:15:52Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Springfield railway station (Scotland)" (https://en.wikipedia.org/wiki/Springfield_railway_station_(Scotland))
[2023-08-11T15:15:56Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Estevan Point" (https://en.wikipedia.org/wiki/Estevan_Point)
[2023-08-11T15:16:28Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Paredes Viejas Airport" (https://en.wikipedia.org/wiki/Paredes_Viejas_Airport)
[2023-08-11T15:16:32Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Magellan's Cross" (https://en.wikipedia.org/wiki/Magellan%27s_Cross)

The Springfield railway station (Scotland) was renamed on 2023-03-29, the content is the correct article html.

The Paredes Viejas Airport article was matched by "Marchigüe Paredes Viejas Airport", listed as a redirect in the 2023-04-01 dump. On 2023-03-24, the article was renamed from "Marchigüe Paredes Viejas Airport" to "Paredes Viejas Airport", and the corresponding wikidata item was updated. The article html was still relevant.

The Magellan's Cross and Estevan Point are different. Neither article was renamed around the time the dump was created, and the html in both is only the redirect page, not the main article content.

estevan_point.json.txt
magellans_cross.json.txt

@biodranik
Copy link
Member

Maybe we may report this issue to people from Wikipedia? Or tag one of them here?

@newsch newsch mentioned this issue Aug 17, 2023
2 tasks
@newsch newsch added this to the v0.2 milestone Aug 17, 2023
@Vuizur
Copy link

Vuizur commented May 6, 2024

Maybe relevant: A user on Wikipedia/Wiktionary has been trying to get Wikimedia to fix errors with the enterprise dumps (such as quite a lot of missing pages) for two years now or so: https://phabricator.wikimedia.org/p/jberkel/

It's still broken now... (On Wiktionary they even considered that the best way forward might be scraping all pages and had a decent proof of concept, but with their slight rate limiting it took more than 2 days IIRC.)

@biodranik
Copy link
Member

@newsch what do our most recent logs show? Are our errors related to that issue?

@newsch
Copy link
Collaborator Author

newsch commented May 7, 2024

The logs won't report this.
I disregarded the missing pages issue initially, since the existing articles are left on the disk. The errors we log are from articles that aren't simplified, around 144 in the last run.

Outdated articles we can't handle right now, the idea I had was to sync article update time with the file metadata, and skip writing if it is older.

As for the duplicates, we have the <head> element heuristic right now but I don't think that catches everything. I need to do a run with debug logs to figure out what else we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants