docs: Add missing descriptions, sources, and licenses #663

dsmedia · 2025-01-12T13:13:06Z

Pull Request: Add missing descriptions, sources, and licenses to datapackage files

Objective:

Following up on the metadata infrastructure work in #634, #639, and #646, this PR adds missing description, source, and license metadata to datapackage_additions.toml. The checklist below indicates which metadata entries are missing and still need to be added. Schema entries (describing each dataset column) will be handled in a subsequent pull request.

Dataset descriptions written to avoid formulations such as "This dataset...", consistent with PEP 257 – Docstring Conventions

~~Open Questions:~~ Complete

Decide how to handle license information that cannot be determined after research, ensuring validation with frictionless standard for licenses.
~~[ ] add disclaimer about license information~~ deferring to separate PR
Regenerated datackapage.json and datapackage.md

Status:

The following checklist indicates the completion status of the description, sources, and licenses metadata for each dataset. A green checkmark (✅) indicates the metadata is present, a red X (❌) indicates the metadata is missing. The leading checkbox is only checked if all three types of metadata are present.

Process:
Changes are validated using scripts/build_datapackage.py, which generates machine-readable metadata describing the contents of /data/.

Legend:

✅ - Indicates the metadata is present
❌ - Indicates the metadata is missing
[x] - Indicates all three types of metadata are present
[ ] - Indicates one or more types of metadata are missing

Checklist:

Update datasets.toml with missing source metadata for 7zip.png dataset

…a-datasets into add-missing-metadata

- adds citation to protovis in desscription - fixes link to image in sources - adds license

- fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

- expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source

- expands description - adds license

- Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

- adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

- global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.

- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license)

- updates description, adds sources and

- adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

- focuses on how data is used in edge bundling example - would benefit from additional detail in the description

- corrects description, adds source, license

- ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description)

dsmedia · 2025-01-20T02:58:18Z

Here is the code to validate the statistical description of normal-2d.json added in a86b9dd.

import pandas as pd
from scipy import stats

# Read data
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/normal-2d.json")

# Generate key statistics
stats_output = {
    "Sample size": len(df),
    "Means": df.mean().round(3).to_dict(),
    "Standard deviations": df.std().round(3).to_dict(),
    "Correlation": round(df.corr().iloc[0, 1], 3),
    "Ranges": {col: [round(df[col].min(), 3), round(df[col].max(), 3)] for col in df.columns},
    "Normality p-values": {col: round(stats.normaltest(df[col]).pvalue, 3) for col in df.columns}
}

# Print statistics
print("Dataset Statistics for Description:")
print(f"Sample size: {stats_output['Sample size']} points")
print(f"Centers: {stats_output['Means']}")
print(f"Standard deviations: {stats_output['Standard deviations']}")
print(f"Correlation: {stats_output['Correlation']}")
print(f"Ranges: {stats_output['Ranges']}")
print(f"Normality test p-values: {stats_output['Normality p-values']}")

...which produced:

Dataset Statistics for Description:
Sample size: 500 points
Centers: {'u': 0.005, 'v': -0.011}
Standard deviations: {'u': 0.192, 'v': 0.199}
Correlation: 0.026
Ranges: {'u': [-0.578, 0.533], 'v': [-0.534, 0.606]}
Normality test p-values: {'u': 0.68, 'v': 0.763}

- related to vega#668

dsmedia · 2025-01-23T12:00:57Z

Here is the code to validate the statistical description of uniform-2d.json added in b55b72f

import pandas as pd
import numpy as np
from urllib.request import urlopen
import json

# Read data
url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/uniform-2d.json"
data = pd.read_json(url)

# Calculate statistics
stats = {
    'count': len(data),
    'meanU': data.u.mean(),
    'meanV': data.v.mean(),
    'minU': data.u.min(),
    'maxU': data.u.max(), 
    'minV': data.v.min(),
    'maxV': data.v.max(),
    'stdU': data.u.std(),
    'stdV': data.v.std(),
    'correlation': data.u.corr(data.v)
}

print("Dataset Verification:")
print(f"Count: {stats['count']}")
print(f"\nMeans:")
print(f"u: {stats['meanU']:.6f}")
print(f"v: {stats['meanV']:.6f}")
print(f"\nRanges:")
print(f"u: [{stats['minU']:.6f}, {stats['maxU']:.6f}]")
print(f"v: [{stats['minV']:.6f}, {stats['maxV']:.6f}]")
print(f"\nStandard deviations:")
print(f"u: {stats['stdU']:.6f}")
print(f"v: {stats['stdV']:.6f}")
print(f"\nCorrelation: {stats['correlation']:.6f}")

- dataset was removed from repo in vega#671

- relies on income.py script from vega#672

…a-datasets into add-missing-metadata

- adds description, sources for sp500.csv - fixes formatting for weekly-weather.json

- uv run scripts/build_datapackage.py # doctest: +SKIP

dsmedia · 2025-01-28T11:13:35Z

@dangotbanned After your review (and no rush on that) would you mind mind handling the merge for this one? I've reviewed it thoroughly and it looks good to me, but I'd appreciate your experience on the final step.

dangotbanned · 2025-01-28T13:27:30Z

#663 (comment)

Thanks for the ping @dsmedia

Yeah can do, will try to get to this soonish.
One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

There's a somewhat related ruff rule with examples.

Some alternatives might be:

Shows ...
Demonstrates ...

Or just skipping that part entirely (e.g. udistrict.json).

It might seem nitpicky, but taking a step back to consider this usually helps find a consistent "voice" for the docs/descriptions when viewed as a whole

Really great work on getting all of this information together, I'm sure it wasn't an easy job

- Partial fix for vega#663 (comment) - edits descriptions through earthquakes.json

dsmedia · 2025-01-29T02:14:54Z

One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

Apt suggestion @dangotbanned, and interesting to see how thoughtfully this kind of formatting style has been considered in PEP 257 – Docstring Conventions. I've ~~started to address~~ addressed in ~~[561e061]~~ e193fe5 ~~and will continue to work through the remainder~~. Let me know if the adjustments were what you had in mind.

- avoids 'this dataset' and similar - reruns datapackage script (json, md)

- new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year

dangotbanned · 2025-01-29T15:36:56Z

One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

Apt suggestion @dangotbanned, and interesting to see how thoughtfully this kind of formatting style has been considered in PEP 257 – Docstring Conventions. I've ~~started to address~~ addressed in ~~[561e061]~~ e193fe5 ~~and will continue to work through the remainder~~. Let me know if the adjustments were what you had in mind.

@dsmedia I've only gone through the recent diffs, but yeah these changes are looking great 🎉

Side note

I've just pushed an overhaul of (vega/altair#3631) in (vega/altair@b606a7d).
Really starting to make good use of datapackage.json for technical reasons, but the descriptions will be super helpful to point to in some form later

dsmedia added 2 commits January 12, 2025 12:23

docs: add sources and license for 7zip resource

e82ffa2

Update datasets.toml with missing source metadata for 7zip.png dataset

chore: uvx taplo fmt

d128104

dsmedia mentioned this pull request Jan 12, 2025

Expand "development process" section in README.md #664

Open

dsmedia added the documentation label Jan 12, 2025

dsmedia added 5 commits January 12, 2025 19:14

docs: add sources and license for ffox.png

ea2f53e

Merge branch 'main' into add-missing-metadata

3612da6

docs: updates zipcodes.csv resource in datapackage_additions.toml

4682b95

Merge branch 'add-missing-metadata' of https://github.com/dsmedia/veg…

87e66e1

…a-datasets into add-missing-metadata

update world-110m.json

e565ab2

dsmedia mentioned this pull request Jan 14, 2025

Dead API link in windVectors.md example gicentre/litvis#99

Open

dsmedia added 8 commits January 16, 2025 04:20

docs: updates us-10m.json

99410bd

docs: updates wheat.json

bc1509e

- adds citation to protovis in desscription - fixes link to image in sources - adds license

docs: adds missing license data to several

1cfffc3

- fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

docs: update metadata for co2-concerntration.csv

4e8ea09

- expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source

docs: adds license to crimea.json metadata

607f90e

docs: update metadata for earthquakes.json

3be7eb6

- expands description - adds license

docs: complete metadata for flights* datasets

516de54

- Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

docs: updates london dataset metadata

6b317af

- adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

dsmedia mentioned this pull request Jan 18, 2025

Add Generation Script for londonTubeLines.json Dataset #667

Open

dsmedia added 7 commits January 18, 2025 23:55

docs: compltes anscombe.json metadata

b8a340c

- updates description, adds sources and

docs: adds budgets.json metadata

c6648fb

- adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

docs: adds basic metadata to flare*.json datasets

e5ccd9c

- focuses on how data is used in edge bundling example - would benefit from additional detail in the description

docs: completes flights-airport.csv metadata

76ce709

- corrects description, adds source, license

This was referenced Jan 21, 2025

feat: add generation script for us-state-capitals.json #668

Merged

"deduct" points.json? #670

Closed

docs: adds us-state-capitals.json metadata

bd20137

- related to vega#668

dsmedia added 3 commits January 23, 2025 06:29

Merge branch 'main' into add-missing-metadata

6dfe4be

docs: adds uniform-2d.json metadata

b55b72f

docs: adds obesity.json metadata

78c7c4a

dsmedia mentioned this pull request Jan 24, 2025

feat: adds generation script for income.json #672

Open

dsmedia added 8 commits January 24, 2025 12:00

docs: remove points.json metadata

7006ff3

- dataset was removed from repo in vega#671

Merge branch 'main' into add-missing-metadata

420dd51

docs: adds metadata for income.json

5a76f84

- relies on income.py script from vega#672

Merge remote-tracking branch 'upstream/main' into add-missing-metadata

38fba4e

Merge branch 'add-missing-metadata' of https://github.com/dsmedia/veg…

9a76d48

…a-datasets into add-missing-metadata

docs: adds metadata for udistrict.json

716aa56

docs: adds, fixes metadata

ddad34f

- adds description, sources for sp500.csv - fixes formatting for weekly-weather.json

docs: updates datapackage

06fe2c8

- uv run scripts/build_datapackage.py # doctest: +SKIP

dsmedia marked this pull request as ready for review January 24, 2025 23:21

dsmedia requested review from domoritz and dangotbanned January 24, 2025 23:21

domoritz approved these changes Jan 25, 2025

View reviewed changes

dangotbanned self-assigned this Jan 28, 2025

docs: begins to recast in PEP 257 style

561e061

- Partial fix for vega#663 (comment) - edits descriptions through earthquakes.json

dsmedia added 2 commits January 29, 2025 11:52

docs: recasts all in PEP 257 format

e193fe5

- avoids 'this dataset' and similar - reruns datapackage script (json, md)

fix: corrects year in description of obesity.json

ca1d569

- new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add missing descriptions, sources, and licenses #663

docs: Add missing descriptions, sources, and licenses #663

dsmedia commented Jan 12, 2025 •

edited

Loading

dsmedia commented Jan 20, 2025 •

edited

Loading

dsmedia commented Jan 23, 2025 •

edited

Loading

dsmedia commented Jan 28, 2025

dangotbanned commented Jan 28, 2025

dsmedia commented Jan 29, 2025 •

edited

Loading

dangotbanned commented Jan 29, 2025

docs: Add missing descriptions, sources, and licenses #663

Are you sure you want to change the base?

docs: Add missing descriptions, sources, and licenses #663

Conversation

dsmedia commented Jan 12, 2025 • edited Loading

Pull Request: Add missing descriptions, sources, and licenses to datapackage files

dsmedia commented Jan 20, 2025 • edited Loading

dsmedia commented Jan 23, 2025 • edited Loading

dsmedia commented Jan 28, 2025

dangotbanned commented Jan 28, 2025

dsmedia commented Jan 29, 2025 • edited Loading

dangotbanned commented Jan 29, 2025

Side note

dsmedia commented Jan 12, 2025 •

edited

Loading

dsmedia commented Jan 20, 2025 •

edited

Loading

dsmedia commented Jan 23, 2025 •

edited

Loading

dsmedia commented Jan 29, 2025 •

edited

Loading