-
-
Notifications
You must be signed in to change notification settings - Fork 63
Add WikiCommons Data Source (Phase 1) #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements automated data fetching from WikiCommons via MediaWiki API using modern shared.py architecture.
Implements automated reporting for WikiCommons processed data. Generates a comprehensive summary table, horizontal bar chart (PNG), and updates quarterly README with key statistics. Includes dry-run mode, data validation, error handling, and logging.
Adds the fetch output CSV for WikiCommons and updates the 2025Q4 README with summary statistics generated by the report script.
This comment was marked as duplicate.
This comment was marked as duplicate.
IamLRBA
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!
I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?
|
Thanks @najuna-brian but I want to ask, why were the 32 license categories chosen? Do they cover all WikiCommons licenses, or are some left out? |
For WikiCommons, the licenses actually come in a pretty consistent format (like “CC-BY-4.0” or “CC-BY-SA-3.0”), so the script uses a simple mapping dictionary that converts those into the normal human-readable form, for example, “CC-BY-4.0” becomes “CC BY 4.0”. If something doesn’t match the mapping, it just keeps the original name instead of dropping it, so nothing is lost. Since WikiCommons sticks to a fixed pattern, we don’t really see the longer names like “Creative Commons Attribution 4.0 International”, so the mapping covers everything cleanly. |
Thanks for asking and I believe it is more or less the same question @IamLRBA asked earlier. Still as I had replied @IamLRBA , in the processing step, the script uses a small mapping dictionary to convert WikiCommons’ category-style names (like CC-BY-SA-4.0) into the cleaner format CC BY-SA 4.0. So, siince WikiCommons mostly sticks to those short, structured names, we don’t really get the long descriptive versions like “Creative Commons Attribution 4.0 International.” If something new or unexpected shows up, the script just keeps it as it is and logs a warning, so it never loses any data, it just passes the name through unchanged. That is to the best of my knowledge and the research I have done, but perhaps if anyone in the community with another idea can chip in and explain further, perhaps @TimidRobot can help me out! |
Yes @Lemeri123, the 32 categories cover all the Creative Commons license types used on WikiCommons, across every version (1.0 → 4.0) and combo like BY, BY-SA, BY-NC, and so on. They’re based on WikiCommons’ own category structure, so we capture all CC licenses without pulling in unrelated or legacy ones like old GFDL-only files. |
That makes sense and thanks for clarifying! I’ll take a closer look at that mapping dictionary to better understand how it’s structured. |
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
My review is not very in depth, I want to spend more time reviewing this work. However, there's some easy changes you can make in the meantime.
@TimidRobot, thanks so much for the valuable feedback! I understand your points and I can make the adjustments. Once I make these changes, I’ll commit them in separate steps for clarity. Thanks! |
Deletes scripts/3-report/ and removes README changes related to reporting. Leaves fetch and process scripts intact for now.
Moved the WikiCommons license normalization dictionary from process_wikicommons.py to shared.py. Updated process_wikcommons.py to import and use shared.LICENSE_NORMALIZATION. Improves code reuse and consistency across fetch and process scripts.
…ons report removed) This commit restores gcs_report.py, github_report.py, and otes.py in scripts/3-report/ from upstream main, while leaving eport_wikicommons.py removed. This corrects the unintended deletions from earlier commit.
…DX mapping reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be renamed to wikicommons_fetch.py to match naming convention.
Please make this script executable.
References:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@najuna-brian the script still needs to be executable
|
Hello @TimidRobot, I have made some changes as per the reviews. |
…Standardized parameter order with encoding before newline
|
|
||
| # Hyphenated to CC legal tool identifier mapping | ||
| # Except PDM-1.0, follows SPDX identifier. Used by WikiCommons. | ||
| LICENSE_NORMALIZATION = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please:
- ensure your branch is synchronized with the main branch of this repository
- order/sort the constants
scripts/1-fetch/wikicommons_fetch.py
Outdated
| check_for_completion() | ||
|
|
||
| session = get_requests_session() | ||
| license_data = query_wikicommons(args, session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revisit your queries. There should be far more results than are reported:
data/2025Q4/1-fetch/wikicommons_1_count.csv
"LICENSE","FILE_COUNT","PAGE_COUNT"
"CC BY 4.0","0","0"
"CC BY-SA 4.0","1","0"
"CC BY-NC 4.0","0","0"
"CC BY-NC-SA 4.0","0","0"
"CC BY-NC-ND 4.0","0","0"
"CC BY-ND 4.0","0","0"
"CC BY 3.0","1","0"
"CC BY-SA 3.0","0","0"
"CC BY-NC 3.0","0","0"
"CC BY-NC-SA 3.0","0","0"
"CC BY-NC-ND 3.0","0","0"
"CC BY-ND 3.0","0","0"
"CC BY 2.5","0","0"
"CC BY-SA 2.5","0","0"
"CC BY-NC 2.5","0","0"
"CC BY-NC-SA 2.5","0","0"
"CC BY-NC-ND 2.5","0","0"
"CC BY-ND 2.5","0","0"
"CC BY 2.0","0","0"
"CC BY-SA 2.0","0","0"
"CC BY-NC 2.0","0","0"
"CC BY-NC-SA 2.0","0","0"
"CC BY-NC-ND 2.0","0","0"
"CC BY-ND 2.0","0","0"
"CC BY 1.0","0","0"
"CC BY-SA 1.0","0","0"
"CC BY-NC 1.0","0","0"
"CC BY-NC-SA 1.0","0","0"
"CC BY-NC-ND 1.0","0","0"
"CC BY-ND 1.0","0","0"
"CC0 1.0","0","0"
"Public Domain Mark 1.0","0","0"
Previously returned mostly zeros due to: - Incorrect category naming (CC BY 4.0 vs Commons CC-BY-4.0) - Missing recursive subcategory counting - No error handling for API failures Now properly: - Converts license names to Commons category format - Recursively counts files in category trees - Handles API errors with exponential backoff - Validates category existence before querying Results show 40M+ files vs previous 2 files across tested licenses.
Fixes
Description
Technical details
scripts/1-fetch/fetch_wikicommons.pyThis collects CC-licensed media counts across 32 license categories via the MediaWiki API
It also handles pagination, error logging, and structured output in CSV format
Tests
Fetch Phase
python scripts/1-fetch/fetch_wikicommons.py --dry-rundata/2025Q4/1-fetch/wikicommons_1_count.csvis created with 32 license categories.Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin