You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources.md
+30-28Lines changed: 30 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,6 +121,36 @@ and access towards related query data using a programmable search engine.
121
121
- Data available through JSON format
122
122
123
123
124
+
## Internet Archive
125
+
126
+
**Description:**
127
+
The Internet Archive is a nonprofit digital library offering free access to millions of digital materials including books, movies, software, music, and websites. This project uses the Internet Archive’s Session and Search API to fetch metadata of items that reference Creative Commons licenses.
128
+
129
+
**API documentation link:**
130
+
-[InternetArchive Tools and APIs](https://archive.org/developers/index-apis.html)
131
+
-[InternetArchive: A Python Interface to archive.org](https://internetarchive.readthedocs.io/en/stable/internetarchive.html)
132
+
-[The Internet Archive Python Library](https://archive.org/developers/internetarchive/)
133
+
-[The Internet Archive Search API reference](https://archive.org/advancedsearch.php)
134
+
-[A Python interface to archive.org.](https://pypi.org/project/internetarchive/)
- Pagination supported via rows and start parameters
140
+
- Python access via internetarchive library (search_items, ArchiveSession)
141
+
- Query limit: None specified, but rate-limiting may apply (1000000 max at a time)
142
+
- Data available through JSON format
143
+
- Retry logic and session management implemented for reliability
144
+
145
+
**Notes:**
146
+
- This project queries for items containing `text:creativecommons.org` in their metadata.
147
+
- The script extracts and normalizes license URLs and language codes
148
+
- In summary, it queries licenseurl and language fields for all items containing "creativecommons.org" in their metadata
149
+
- Aggregated counts are saved to CSV files for licenses and languages.
150
+
- License normalization uses a canonical mapping defined in `license_url_to_identifier_mapping.csv`.
151
+
- Language normalization using Babel and [iso-639](https://pypi.org/project/iso639-lang/) see [github information](https://github.com/jacksonllee/iso639), see also [iso-639 standards](https://www.loc.gov/standards/iso639-2/), you can also checkout [iso639-2](https://www.loc.gov/standards/iso639-2/php/English_list.php)
152
+
153
+
124
154
## Openverse
125
155
126
156
**Description:** Openverse is a search engine for openly licensed media,
@@ -168,31 +198,3 @@ language edition of wikipedia. It runs on the Meta-Wiki API.
168
198
- No API key required
169
199
- Query limit: It is rate-limited only to prevent abuse
170
200
- Data available through XML or JSON format
171
-
172
-
## Internet Archive
173
-
174
-
**Description:**
175
-
The Internet Archive is a nonprofit digital library offering free access to millions of digital materials including books, movies, software, music, and websites. This project uses the Internet Archive’s Session and Search API to fetch metadata of items that reference Creative Commons licenses.
176
-
177
-
**API documentation link:**
178
-
-[InternetArchive: A Python Interface to archive.org](https://internetarchive.readthedocs.io/en/stable/internetarchive.html)
179
-
-[The Internet Archive Python Library](https://archive.org/developers/internetarchive/)
180
-
-[The Internet Archive Search API reference](https://archive.org/advancedsearch.php)
181
-
-[A Python interface to archive.org.](https://pypi.org/project/internetarchive/)
- Pagination supported via rows and start parameters
187
-
- Python access via internetarchive library (search_items, ArchiveSession)
188
-
- Query limit: None specified, but rate-limiting may apply (1000000 max at a time)
189
-
- Data available through JSON format
190
-
- Retry logic and session management implemented for reliability
191
-
192
-
**Notes:**
193
-
- This project queries for items containing `text:creativecommons.org` in their metadata.
194
-
- The script extracts and normalizes license URLs and language codes
195
-
- In summary, it queries licenseurl and language fields for all items containing "creativecommons.org" in their metadata
196
-
- Aggregated counts are saved to CSV files for licenses and languages.
197
-
- License normalization uses a canonical mapping defined in `license_url_to_identifier_mapping.csv`.
198
-
- Language normalization using Babel and [iso-639](https://pypi.org/project/iso639-lang/) see [github information](https://github.com/jacksonllee/iso639), see also [iso-639 standards](https://www.loc.gov/standards/iso639-2/), you can also checkout [iso639-2](https://www.loc.gov/standards/iso639-2/php/English_list.php)
0 commit comments