Add WikiCommons Data Source (Phase 1) #183

najuna-brian · 2025-10-12T19:39:01Z

Fixes

Fixes Add WikiCommons Data Source #180 by @najuna-brian

Description

This PR adds the WikiCommons fetch phase for the Quantifying the Commons project.
It focuses solely on collecting Creative Commons–licensed media counts via the MediaWiki API.
Subsequent processing and reporting phases will be implemented in future PRs.

Technical details

Phase 1: Fetch
- scripts/1-fetch/fetch_wikicommons.py
  This collects CC-licensed media counts across 32 license categories via the MediaWiki API
  It also handles pagination, error logging, and structured output in CSV format

Tests

Fetch Phase
python scripts/1-fetch/fetch_wikicommons.py --dry-run

This verifies that data/2025Q4/1-fetch/wikicommons_1_count.csv is created with 32 license categories.
It also confirms that the script connects to the MediaWiki API and handles empty or missing categories gracefully.

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Implements automated data fetching from WikiCommons via MediaWiki API using modern shared.py architecture.

Implements automated reporting for WikiCommons processed data. Generates a comprehensive summary table, horizontal bar chart (PNG), and updates quarterly README with key statistics. Includes dry-run mode, data validation, error handling, and logging.

Adds the fetch output CSV for WikiCommons and updates the 2025Q4 README with summary statistics generated by the report script.

IamLRBA

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!

I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

Lemeri123 · 2025-10-13T11:22:28Z

Thanks @najuna-brian but I want to ask, why were the 32 license categories chosen? Do they cover all WikiCommons licenses, or are some left out?

najuna-brian · 2025-10-13T11:24:46Z

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!

I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

For WikiCommons, the licenses actually come in a pretty consistent format (like “CC-BY-4.0” or “CC-BY-SA-3.0”), so the script uses a simple mapping dictionary that converts those into the normal human-readable form, for example, “CC-BY-4.0” becomes “CC BY 4.0”.

If something doesn’t match the mapping, it just keeps the original name instead of dropping it, so nothing is lost. Since WikiCommons sticks to a fixed pattern, we don’t really see the longer names like “Creative Commons Attribution 4.0 International”, so the mapping covers everything cleanly.

najuna-brian · 2025-10-13T11:35:13Z

Hello @najuna-brian, I wanted clarification on something----- How exactly does process_wikicommons.py handle different versions of license names (e.g., "CC BY 4.0" vs. "Creative Commons Attribution 4.0")? What happens if an unknown license shows up?

Thanks for asking and I believe it is more or less the same question @IamLRBA asked earlier.

Still as I had replied @IamLRBA , in the processing step, the script uses a small mapping dictionary to convert WikiCommons’ category-style names (like CC-BY-SA-4.0) into the cleaner format CC BY-SA 4.0.

So, siince WikiCommons mostly sticks to those short, structured names, we don’t really get the long descriptive versions like “Creative Commons Attribution 4.0 International.”

If something new or unexpected shows up, the script just keeps it as it is and logs a warning, so it never loses any data, it just passes the name through unchanged.

That is to the best of my knowledge and the research I have done, but perhaps if anyone in the community with another idea can chip in and explain further, perhaps @TimidRobot can help me out!

najuna-brian · 2025-10-13T11:39:44Z

Thanks @najuna-brian but I want to ask, why were the 32 license categories chosen? Do they cover all WikiCommons licenses, or are some left out?

Yes @Lemeri123, the 32 categories cover all the Creative Commons license types used on WikiCommons, across every version (1.0 → 4.0) and combo like BY, BY-SA, BY-NC, and so on.

They’re based on WikiCommons’ own category structure, so we capture all CC licenses without pulling in unrelated or legacy ones like old GFDL-only files.

IamLRBA · 2025-10-13T16:39:56Z

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!
I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

For WikiCommons, the licenses actually come in a pretty consistent format (like “CC-BY-4.0” or “CC-BY-SA-3.0”), so the script uses a simple mapping dictionary that converts those into the normal human-readable form, for example, “CC-BY-4.0” becomes “CC BY 4.0”.

If something doesn’t match the mapping, it just keeps the original name instead of dropping it, so nothing is lost. Since WikiCommons sticks to a fixed pattern, we don’t really see the longer names like “Creative Commons Attribution 4.0 International”, so the mapping covers everything cleanly.

That makes sense and thanks for clarifying!
I didn’t realize WikiCommons had such consistent license formatting. I must say, It’s neat that the script just falls back to keeping the original name instead of dropping it, that feels like a safe design choice.

I’ll take a closer look at that mapping dictionary to better understand how it’s structured.

TimidRobot

Great work!

My review is not very in depth, I want to spend more time reviewing this work. However, there's some easy changes you can make in the meantime.

data/2025Q4/1-fetch/wikicommons_1_count.csv

data/2025Q4/2-process/wikicommons_2_processed.csv

data/2025Q4/3-report/wikicommons_summary.png

data/2025Q4/README.md

scripts/2-process/process_wikicommons.py

najuna-brian · 2025-10-14T15:51:06Z

Great work!

My review is not very in depth, I want to spend more time reviewing this work. However, there's some easy changes you can make in the meantime.

@TimidRobot, thanks so much for the valuable feedback!

I understand your points and I can make the adjustments. Once I make these changes, I’ll commit them in separate steps for clarity.

Thanks!

Deletes scripts/3-report/ and removes README changes related to reporting. Leaves fetch and process scripts intact for now.

Moved the WikiCommons license normalization dictionary from process_wikicommons.py to shared.py. Updated process_wikcommons.py to import and use shared.LICENSE_NORMALIZATION. Improves code reuse and consistency across fetch and process scripts.

…ons report removed) This commit restores gcs_report.py, github_report.py, and otes.py in scripts/3-report/ from upstream main, while leaving eport_wikicommons.py removed. This corrects the unintended deletions from earlier commit.

scripts/shared.py

…DX mapping reference

scripts/1-fetch/fetch_wikicommons.py

scripts/1-fetch/wikicommons_fetch.py

scripts/1-fetch/fetch_wikicommons.py

TimidRobot · 2025-10-29T10:52:39Z

scripts/1-fetch/wikicommons_fetch.py

This file should be renamed to wikicommons_fetch.py to match naming convention.

Please make this script executable.

References:

https://github.com/creativecommons/quantifying#running-the-scripts

https://opensource.creativecommons.org/contributing-code/foundational-tech/#file-permissions

@najuna-brian the script still needs to be executable

najuna-brian · 2025-10-30T12:38:15Z

Hello @TimidRobot, I have made some changes as per the reviews.
Thanks!

data/2025Q4/README.md

scripts/1-fetch/wikicommons_fetch.py

…Standardized parameter order with encoding before newline

TimidRobot · 2025-10-31T06:45:29Z

scripts/shared.py

+
+# Hyphenated to CC legal tool identifier mapping
+# Except PDM-1.0, follows SPDX identifier. Used by WikiCommons.
+LICENSE_NORMALIZATION = {


Please:

ensure your branch is synchronized with the main branch of this repository

order/sort the constants

TimidRobot · 2025-10-31T06:53:06Z

scripts/1-fetch/wikicommons_fetch.py

+    check_for_completion()
+
+    session = get_requests_session()
+    license_data = query_wikicommons(args, session)


Please revisit your queries. There should be far more results than are reported:

data/2025Q4/1-fetch/wikicommons_1_count.csv

"LICENSE","FILE_COUNT","PAGE_COUNT" "CC BY 4.0","0","0" "CC BY-SA 4.0","1","0" "CC BY-NC 4.0","0","0" "CC BY-NC-SA 4.0","0","0" "CC BY-NC-ND 4.0","0","0" "CC BY-ND 4.0","0","0" "CC BY 3.0","1","0" "CC BY-SA 3.0","0","0" "CC BY-NC 3.0","0","0" "CC BY-NC-SA 3.0","0","0" "CC BY-NC-ND 3.0","0","0" "CC BY-ND 3.0","0","0" "CC BY 2.5","0","0" "CC BY-SA 2.5","0","0" "CC BY-NC 2.5","0","0" "CC BY-NC-SA 2.5","0","0" "CC BY-NC-ND 2.5","0","0" "CC BY-ND 2.5","0","0" "CC BY 2.0","0","0" "CC BY-SA 2.0","0","0" "CC BY-NC 2.0","0","0" "CC BY-NC-SA 2.0","0","0" "CC BY-NC-ND 2.0","0","0" "CC BY-ND 2.0","0","0" "CC BY 1.0","0","0" "CC BY-SA 1.0","0","0" "CC BY-NC 1.0","0","0" "CC BY-NC-SA 1.0","0","0" "CC BY-NC-ND 1.0","0","0" "CC BY-ND 1.0","0","0" "CC0 1.0","0","0" "Public Domain Mark 1.0","0","0"

Previously returned mostly zeros due to: - Incorrect category naming (CC BY 4.0 vs Commons CC-BY-4.0) - Missing recursive subcategory counting - No error handling for API failures Now properly: - Converts license names to Commons category format - Recursively counts files in category trees - Handles API errors with exponential backoff - Validates category existence before querying Results show 40M+ files vs previous 2 files across tested licenses.

najuna-brian added 4 commits October 12, 2025 21:17

Add WikiCommons fetch script (Phase 1)

69c8982

Implements automated data fetching from WikiCommons via MediaWiki API using modern shared.py architecture.

Add WikiCommons process script (Phase 2)

70340cd

Include Phase 1 CSV and updated quarterly README

c1ddb36

Adds the fetch output CSV for WikiCommons and updates the 2025Q4 README with summary statistics generated by the report script.

najuna-brian requested review from a team as code owners October 12, 2025 19:39

najuna-brian requested review from Shafiya-Heena and TimidRobot and removed request for a team October 12, 2025 19:39

cc-open-source-bot moved this to In review in TimidRobot Oct 12, 2025

cc-open-source-bot added this to TimidRobot Oct 12, 2025

This comment was marked as duplicate.

Sign in to view

IamLRBA reviewed Oct 13, 2025

View reviewed changes

TimidRobot requested changes Oct 14, 2025

View reviewed changes

TimidRobot self-assigned this Oct 14, 2025

najuna-brian and others added 4 commits October 14, 2025 19:02

Remove report files and updates

1380858

Deletes scripts/3-report/ and removes README changes related to reporting. Leaves fetch and process scripts intact for now.

Merge branch 'creativecommons:main' into main

8105eae

TimidRobot reviewed Oct 15, 2025

View reviewed changes

scripts/shared.py Outdated Show resolved Hide resolved

This was referenced Oct 16, 2025

Add Wikipedia as data source #167

Merged

Add Openverse Fetch Script (Initial Implementation) #185

Merged

docs(shared): updated the license normalization comment to clarify SP…

0c6650d

…DX mapping reference

TimidRobot requested changes Oct 17, 2025

View reviewed changes

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

scripts/1-fetch/wikicommons_fetch.py Show resolved Hide resolved

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

Babi-B reviewed Oct 28, 2025

View reviewed changes

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

najuna-brian changed the title ~~Automate WikiCommons Data Source (Phases 1-3)~~ Automate WikiCommons Data Source (Phases 1) Oct 28, 2025

najuna-brian changed the title ~~Automate WikiCommons Data Source (Phases 1)~~ Automate WikiCommons Data Source (Phase 1) Oct 28, 2025

TimidRobot reviewed Oct 29, 2025

View reviewed changes

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 29, 2025

View reviewed changes

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 29, 2025

View reviewed changes

scripts/1-fetch/fetch_wikicommons.py Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 29, 2025

View reviewed changes

najuna-brian added 9 commits October 30, 2025 14:19

Remove unnecessary try block causing syntax error

3bb2638

Sort constants alphabetically

1ec4388

Move script execution log to main function

c16d094

Update backoff_factor to 10 for consistency

75c6b87

Remove newline parameter for reading to use universal newlines

73cf1b6

use explicit newline for writing CSV files

8403519

Rename to wikicommons_fetch.py

b46057a

Name correction

3fcb3c7

Fixed Indention

ed4775d

TimidRobot reviewed Oct 30, 2025

View reviewed changes

data/2025Q4/README.md Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 30, 2025

View reviewed changes

scripts/1-fetch/wikicommons_fetch.py Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 30, 2025

View reviewed changes

scripts/1-fetch/wikicommons_fetch.py Outdated Show resolved Hide resolved

TimidRobot reviewed Oct 30, 2025

View reviewed changes

scripts/1-fetch/wikicommons_fetch.py Outdated Show resolved Hide resolved

najuna-brian added 3 commits October 30, 2025 19:35

Removed data/2025Q4/README.md to from the PR

df71aa1

Change to return a default dictionary instead of None

63158e6

Added encoding parameter to file reading in check_for_completion and …

f4a793a

…Standardized parameter order with encoding before newline

TimidRobot reviewed Oct 31, 2025

View reviewed changes

TimidRobot changed the title ~~Automate WikiCommons Data Source (Phase 1)~~ Add WikiCommons Data Source (Phase 1) Nov 6, 2025

najuna-brian and others added 3 commits November 6, 2025 12:53

Restoring backoff_factor to 10

8177996

Merge branch 'main' into main

bdc8c8e

Uh oh!

Add WikiCommons Data Source (Phase 1) #183

Are you sure you want to change the base?

Add WikiCommons Data Source (Phase 1) #183

Uh oh!

Conversation

najuna-brian commented Oct 12, 2025 • edited by TimidRobot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Description

Technical details

Tests

Checklist

Developer Certificate of Origin

Uh oh!

This comment was marked as duplicate.

IamLRBA left a comment

Choose a reason for hiding this comment

Uh oh!

Lemeri123 commented Oct 13, 2025

Uh oh!

najuna-brian commented Oct 13, 2025

Uh oh!

najuna-brian commented Oct 13, 2025

Uh oh!

najuna-brian commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IamLRBA commented Oct 13, 2025

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

najuna-brian commented Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimidRobot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

najuna-brian commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimidRobot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

najuna-brian commented Oct 12, 2025 •

edited by TimidRobot

Loading

najuna-brian commented Oct 13, 2025 •

edited

Loading