Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add missing descriptions, sources, and licenses #663

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Jan 12, 2025

Pull Request: Add missing descriptions, sources, and licenses to datapackage files

Objective:

Following up on the metadata infrastructure work in #634, #639, and #646, this PR adds missing description, source, and license metadata to datapackage_additions.toml. The checklist below indicates which metadata entries are missing and still need to be added. Schema entries (describing each dataset column) will be handled in a subsequent pull request.

Dataset descriptions written to avoid formulations such as "This dataset...", consistent with PEP 257 – Docstring Conventions

Open Questions: Complete

  • Decide how to handle license information that cannot be determined after research, ensuring validation with frictionless standard for licenses.
  • [ ] add disclaimer about license information deferring to separate PR
  • Regenerated datackapage.json and datapackage.md

Status:

The following checklist indicates the completion status of the description, sources, and licenses metadata for each dataset. A green checkmark (✅) indicates the metadata is present, a red X (❌) indicates the metadata is missing. The leading checkbox is only checked if all three types of metadata are present.

Process:
Changes are validated using scripts/build_datapackage.py, which generates machine-readable metadata describing the contents of /data/.

Legend:

  • - Indicates the metadata is present
  • - Indicates the metadata is missing
  • [x] - Indicates all three types of metadata are present
  • [ ] - Indicates one or more types of metadata are missing

Checklist:

Update datasets.toml with missing source metadata for 7zip.png dataset
- adds citation to protovis in desscription
- fixes link to image in sources
- adds license
- fixes bad link in annual-precip.json; adds license
- adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json
- expands description to explain units and seasonal adjustment
- adds additional source directly to dataset csv
- adds license details from source
- expands description
- adds license
- Document that data used in flights* datasets are collected under US DOT requirements
- Add row counts to flight dataset descriptions (2k-3M rows)
- Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms
- adds license for londonBoroughs.json
- adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json)
- expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json
- global-temp.csv
- iowa-electricity.csv
- jobs.json
- monarchs.json
- political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license
- population_engineers_hurricanes.csv
- seattle-weather-hourly-normals.csv
- seattle-weather.csv
- unemployment-across-industries.json
- unemployment.tsv
- us-employment.csv
- weather.csv

Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.
- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json)
- airports.csv (adds description, sources, license)
- barley.csv (updates description and source; adds license)
- disasters.csv (expands description, updates sources, add license)
- driving.json (adds description, updates source, adds license)
- ohlc.json (modifies description, adds additional source, and license)
- stocks.csv (adds source, license)
- weekly-weather.json (adds source, license)
- windvectors.csv (adds source, license)
- updates description, adds sources and
- adds description, source and license
- makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined
- focuses on how data is used in edge bundling example
- would benefit from additional detail in the description
- corrects description, adds source, license
- ffox.png (updates license)
- gapminder.json (adds license)
- gimp.png (updates description, adds source, license)
- github.csv (adds description, source, license)
- lookup_groups.csv, lookup_people.csv (adds description, source, license)
- miserables.json adds description, source, license)
- movies.json (adds source, license)
- normal-2d.json (adds description, source, license)
- stocks.csv (adds description)
@dsmedia
Copy link
Collaborator Author

dsmedia commented Jan 20, 2025

Here is the code to validate the statistical description of normal-2d.json added in a86b9dd.

import pandas as pd
from scipy import stats

# Read data
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/normal-2d.json")

# Generate key statistics
stats_output = {
    "Sample size": len(df),
    "Means": df.mean().round(3).to_dict(),
    "Standard deviations": df.std().round(3).to_dict(),
    "Correlation": round(df.corr().iloc[0, 1], 3),
    "Ranges": {col: [round(df[col].min(), 3), round(df[col].max(), 3)] for col in df.columns},
    "Normality p-values": {col: round(stats.normaltest(df[col]).pvalue, 3) for col in df.columns}
}

# Print statistics
print("Dataset Statistics for Description:")
print(f"Sample size: {stats_output['Sample size']} points")
print(f"Centers: {stats_output['Means']}")
print(f"Standard deviations: {stats_output['Standard deviations']}")
print(f"Correlation: {stats_output['Correlation']}")
print(f"Ranges: {stats_output['Ranges']}")
print(f"Normality test p-values: {stats_output['Normality p-values']}")

...which produced:

Dataset Statistics for Description:
Sample size: 500 points
Centers: {'u': 0.005, 'v': -0.011}
Standard deviations: {'u': 0.192, 'v': 0.199}
Correlation: 0.026
Ranges: {'u': [-0.578, 0.533], 'v': [-0.534, 0.606]}
Normality test p-values: {'u': 0.68, 'v': 0.763}

@dsmedia
Copy link
Collaborator Author

dsmedia commented Jan 23, 2025

Here is the code to validate the statistical description of uniform-2d.json added in b55b72f

import pandas as pd
import numpy as np
from urllib.request import urlopen
import json

# Read data
url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/uniform-2d.json"
data = pd.read_json(url)

# Calculate statistics
stats = {
    'count': len(data),
    'meanU': data.u.mean(),
    'meanV': data.v.mean(),
    'minU': data.u.min(),
    'maxU': data.u.max(), 
    'minV': data.v.min(),
    'maxV': data.v.max(),
    'stdU': data.u.std(),
    'stdV': data.v.std(),
    'correlation': data.u.corr(data.v)
}

print("Dataset Verification:")
print(f"Count: {stats['count']}")
print(f"\nMeans:")
print(f"u: {stats['meanU']:.6f}")
print(f"v: {stats['meanV']:.6f}")
print(f"\nRanges:")
print(f"u: [{stats['minU']:.6f}, {stats['maxU']:.6f}]")
print(f"v: [{stats['minV']:.6f}, {stats['maxV']:.6f}]")
print(f"\nStandard deviations:")
print(f"u: {stats['stdU']:.6f}")
print(f"v: {stats['stdV']:.6f}")
print(f"\nCorrelation: {stats['correlation']:.6f}")

@dsmedia dsmedia marked this pull request as ready for review January 24, 2025 23:21
@dsmedia
Copy link
Collaborator Author

dsmedia commented Jan 28, 2025

@dangotbanned After your review (and no rush on that) would you mind mind handling the merge for this one? I've reviewed it thoroughly and it looks good to me, but I'd appreciate your experience on the final step.

@dangotbanned
Copy link
Member

#663 (comment)

Thanks for the ping @dsmedia

Yeah can do, will try to get to this soonish.
One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

There's a somewhat related ruff rule with examples.

Some alternatives might be:

Shows ...
Demonstrates ...

Or just skipping that part entirely (e.g. udistrict.json).

It might seem nitpicky, but taking a step back to consider this usually helps find a consistent "voice" for the docs/descriptions when viewed as a whole


Really great work on getting all of this information together, I'm sure it wasn't an easy job

@dangotbanned dangotbanned self-assigned this Jan 28, 2025
- Partial fix for vega#663 (comment)
- edits descriptions through earthquakes.json
@dsmedia
Copy link
Collaborator Author

dsmedia commented Jan 29, 2025

One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

Apt suggestion @dangotbanned, and interesting to see how thoughtfully this kind of formatting style has been considered in PEP 257 – Docstring Conventions. I've started to address addressed in [561e061] e193fe5 and will continue to work through the remainder. Let me know if the adjustments were what you had in mind.

- avoids 'this dataset' and similar
- reruns datapackage script (json, md)
- new source found confirming 1995, not 2008, data is shown, consistent with CDC data
- removes link to vega example that references wrong source year
@dangotbanned
Copy link
Member

One general note I have now is to think about reducing the number of:

This dataset ...
This file ...

Apt suggestion @dangotbanned, and interesting to see how thoughtfully this kind of formatting style has been considered in PEP 257 – Docstring Conventions. I've started to address addressed in [561e061] e193fe5 and will continue to work through the remainder. Let me know if the adjustments were what you had in mind.

@dsmedia I've only gone through the recent diffs, but yeah these changes are looking great 🎉

Side note

I've just pushed an overhaul of (vega/altair#3631) in (vega/altair@b606a7d).
Really starting to make good use of datapackage.json for technical reasons, but the descriptions will be super helpful to point to in some form later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants