Skip to content

Conversation

@najuna-brian
Copy link

@najuna-brian najuna-brian commented Oct 12, 2025

Fixes

Description

  • This PR adds the WikiCommons fetch phase for the Quantifying the Commons project.
  • It focuses solely on collecting Creative Commons–licensed media counts via the MediaWiki API.
  • Subsequent processing and reporting phases will be implemented in future PRs.

Technical details

  • Phase 1: Fetch
    • scripts/1-fetch/fetch_wikicommons.py
      This collects CC-licensed media counts across 32 license categories via the MediaWiki API
      It also handles pagination, error logging, and structured output in CSV format

Tests

Fetch Phase
python scripts/1-fetch/fetch_wikicommons.py --dry-run

  • This verifies that data/2025Q4/1-fetch/wikicommons_1_count.csv is created with 32 license categories.
  • It also confirms that the script connects to the MediaWiki API and handles empty or missing categories gracefully.

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Implements automated data fetching from WikiCommons via MediaWiki API using modern shared.py architecture.
Implements automated reporting for WikiCommons processed data.
Generates a comprehensive summary table, horizontal bar chart (PNG),
and updates quarterly README with key statistics. Includes
dry-run mode, data validation, error handling, and logging.
Adds the fetch output CSV for WikiCommons and updates the 2025Q4 README
with summary statistics generated by the report script.
@najuna-brian najuna-brian requested review from a team as code owners October 12, 2025 19:39
@najuna-brian najuna-brian requested review from Shafiya-Heena and TimidRobot and removed request for a team October 12, 2025 19:39
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 12, 2025
@najuna-brian

This comment was marked as duplicate.

Copy link

@IamLRBA IamLRBA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!

I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

@Lemeri123
Copy link

Thanks @najuna-brian but I want to ask, why were the 32 license categories chosen? Do they cover all WikiCommons licenses, or are some left out?

@najuna-brian
Copy link
Author

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!

I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

For WikiCommons, the licenses actually come in a pretty consistent format (like “CC-BY-4.0” or “CC-BY-SA-3.0”), so the script uses a simple mapping dictionary that converts those into the normal human-readable form, for example, “CC-BY-4.0” becomes “CC BY 4.0”.

If something doesn’t match the mapping, it just keeps the original name instead of dropping it, so nothing is lost. Since WikiCommons sticks to a fixed pattern, we don’t really see the longer names like “Creative Commons Attribution 4.0 International”, so the mapping covers everything cleanly.

@najuna-brian
Copy link
Author

Hello @najuna-brian, I wanted clarification on something----- How exactly does process_wikicommons.py handle different versions of license names (e.g., "CC BY 4.0" vs. "Creative Commons Attribution 4.0")? What happens if an unknown license shows up?

Thanks for asking and I believe it is more or less the same question @IamLRBA asked earlier.

Still as I had replied @IamLRBA , in the processing step, the script uses a small mapping dictionary to convert WikiCommons’ category-style names (like CC-BY-SA-4.0) into the cleaner format CC BY-SA 4.0.

So, siince WikiCommons mostly sticks to those short, structured names, we don’t really get the long descriptive versions like “Creative Commons Attribution 4.0 International.”

If something new or unexpected shows up, the script just keeps it as it is and logs a warning, so it never loses any data, it just passes the name through unchanged.

That is to the best of my knowledge and the research I have done, but perhaps if anyone in the community with another idea can chip in and explain further, perhaps @TimidRobot can help me out!

@najuna-brian
Copy link
Author

najuna-brian commented Oct 13, 2025

Thanks @najuna-brian but I want to ask, why were the 32 license categories chosen? Do they cover all WikiCommons licenses, or are some left out?

Yes @Lemeri123, the 32 categories cover all the Creative Commons license types used on WikiCommons, across every version (1.0 → 4.0) and combo like BY, BY-SA, BY-NC, and so on.

They’re based on WikiCommons’ own category structure, so we capture all CC licenses without pulling in unrelated or legacy ones like old GFDL-only files.

@IamLRBA
Copy link

IamLRBA commented Oct 13, 2025

Thanks so much for this work @najuna-brian, it’s really cool to see the WikiCommons integration!
I’m still wrapping my head around how the license normalization works in the processing step. In process_wikicommons.py, how do you handle cases where the same license is titled differently (for example “CC BY 4.0” vs “Creative Commons Attribution 4.0 International”)? Do you have a map of canonical names somewhere, or is there a fallback rule?

For WikiCommons, the licenses actually come in a pretty consistent format (like “CC-BY-4.0” or “CC-BY-SA-3.0”), so the script uses a simple mapping dictionary that converts those into the normal human-readable form, for example, “CC-BY-4.0” becomes “CC BY 4.0”.

If something doesn’t match the mapping, it just keeps the original name instead of dropping it, so nothing is lost. Since WikiCommons sticks to a fixed pattern, we don’t really see the longer names like “Creative Commons Attribution 4.0 International”, so the mapping covers everything cleanly.

That makes sense and thanks for clarifying!
I didn’t realize WikiCommons had such consistent license formatting. I must say, It’s neat that the script just falls back to keeping the original name instead of dropping it, that feels like a safe design choice.

I’ll take a closer look at that mapping dictionary to better understand how it’s structured.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

My review is not very in depth, I want to spend more time reviewing this work. However, there's some easy changes you can make in the meantime.

@TimidRobot TimidRobot self-assigned this Oct 14, 2025
@najuna-brian
Copy link
Author

Great work!

My review is not very in depth, I want to spend more time reviewing this work. However, there's some easy changes you can make in the meantime.

@TimidRobot, thanks so much for the valuable feedback!

I understand your points and I can make the adjustments. Once I make these changes, I’ll commit them in separate steps for clarity.

Thanks!

najuna-brian and others added 4 commits October 14, 2025 19:02
Deletes scripts/3-report/ and removes README changes related to reporting.
Leaves fetch and process scripts intact for now.
Moved the WikiCommons license normalization dictionary from process_wikicommons.py to shared.py.
Updated process_wikcommons.py to import and use shared.LICENSE_NORMALIZATION.
Improves code reuse and consistency across fetch and process scripts.
…ons report removed)

This commit restores gcs_report.py, github_report.py, and
otes.py in scripts/3-report/
from upstream main, while leaving 
eport_wikicommons.py removed. This corrects the unintended deletions from earlier commit.
@najuna-brian najuna-brian changed the title Automate WikiCommons Data Source (Phases 1-3) Automate WikiCommons Data Source (Phases 1) Oct 28, 2025
@najuna-brian najuna-brian changed the title Automate WikiCommons Data Source (Phases 1) Automate WikiCommons Data Source (Phase 1) Oct 28, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be renamed to wikicommons_fetch.py to match naming convention.

Please make this script executable.

References:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@najuna-brian the script still needs to be executable

@najuna-brian
Copy link
Author

Hello @TimidRobot, I have made some changes as per the reviews.
Thanks!


# Hyphenated to CC legal tool identifier mapping
# Except PDM-1.0, follows SPDX identifier. Used by WikiCommons.
LICENSE_NORMALIZATION = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please:

  1. ensure your branch is synchronized with the main branch of this repository
  2. order/sort the constants

check_for_completion()

session = get_requests_session()
license_data = query_wikicommons(args, session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revisit your queries. There should be far more results than are reported:

data/2025Q4/1-fetch/wikicommons_1_count.csv
"LICENSE","FILE_COUNT","PAGE_COUNT"
"CC BY 4.0","0","0"
"CC BY-SA 4.0","1","0"
"CC BY-NC 4.0","0","0"
"CC BY-NC-SA 4.0","0","0"
"CC BY-NC-ND 4.0","0","0"
"CC BY-ND 4.0","0","0"
"CC BY 3.0","1","0"
"CC BY-SA 3.0","0","0"
"CC BY-NC 3.0","0","0"
"CC BY-NC-SA 3.0","0","0"
"CC BY-NC-ND 3.0","0","0"
"CC BY-ND 3.0","0","0"
"CC BY 2.5","0","0"
"CC BY-SA 2.5","0","0"
"CC BY-NC 2.5","0","0"
"CC BY-NC-SA 2.5","0","0"
"CC BY-NC-ND 2.5","0","0"
"CC BY-ND 2.5","0","0"
"CC BY 2.0","0","0"
"CC BY-SA 2.0","0","0"
"CC BY-NC 2.0","0","0"
"CC BY-NC-SA 2.0","0","0"
"CC BY-NC-ND 2.0","0","0"
"CC BY-ND 2.0","0","0"
"CC BY 1.0","0","0"
"CC BY-SA 1.0","0","0"
"CC BY-NC 1.0","0","0"
"CC BY-NC-SA 1.0","0","0"
"CC BY-NC-ND 1.0","0","0"
"CC BY-ND 1.0","0","0"
"CC0 1.0","0","0"
"Public Domain Mark 1.0","0","0"

@TimidRobot TimidRobot changed the title Automate WikiCommons Data Source (Phase 1) Add WikiCommons Data Source (Phase 1) Nov 6, 2025
najuna-brian and others added 3 commits November 6, 2025 12:53
Previously returned mostly zeros due to:
- Incorrect category naming (CC BY 4.0 vs Commons CC-BY-4.0)
- Missing recursive subcategory counting
- No error handling for API failures

Now properly:
- Converts license names to Commons category format
- Recursively counts files in category trees
- Handles API errors with exponential backoff
- Validates category existence before querying

Results show 40M+ files vs previous 2 files across tested licenses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add WikiCommons Data Source

5 participants