-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate articles without QID #24
Comments
Some more examples found while checking simplification:
The Springfield railway station (Scotland) was renamed on 2023-03-29, the content is the correct article html. The Paredes Viejas Airport article was matched by "Marchigüe Paredes Viejas Airport", listed as a redirect in the 2023-04-01 dump. On 2023-03-24, the article was renamed from "Marchigüe Paredes Viejas Airport" to "Paredes Viejas Airport", and the corresponding wikidata item was updated. The article html was still relevant. The Magellan's Cross and Estevan Point are different. Neither article was renamed around the time the dump was created, and the html in both is only the redirect page, not the main article content. |
Maybe we may report this issue to people from Wikipedia? Or tag one of them here? |
Maybe relevant: A user on Wikipedia/Wiktionary has been trying to get Wikimedia to fix errors with the enterprise dumps (such as quite a lot of missing pages) for two years now or so: https://phabricator.wikimedia.org/p/jberkel/ It's still broken now... (On Wiktionary they even considered that the best way forward might be scraping all pages and had a decent proof of concept, but with their slight rate limiting it took more than 2 days IIRC.) |
@newsch what do our most recent logs show? Are our errors related to that issue? |
The logs won't report this. Outdated articles we can't handle right now, the idea I had was to sync article update time with the file metadata, and skip writing if it is older. As for the duplicates, we have the |
The schema for the wikipedia enterprise dumps lists the QID field (
main_entity
) as optional.All articles should have a QID, but apparently there are cases where they don't.
It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:
Both articles were edited on 2023-07-31, around when the dump was created:
Is this the main cause of these cases, or is there something else?
Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?
The text was updated successfully, but these errors were encountered: