-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add missing descriptions, sources, and licenses #663
base: main
Are you sure you want to change the base?
Conversation
Update datasets.toml with missing source metadata for 7zip.png dataset
…a-datasets into add-missing-metadata
- adds citation to protovis in desscription - fixes link to image in sources - adds license
- fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json
- expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source
- expands description - adds license
- Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms
- adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json
- global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.
- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license)
- updates description, adds sources and
- adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined
- focuses on how data is used in edge bundling example - would benefit from additional detail in the description
- corrects description, adds source, license
- ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description)
Here is the code to validate the statistical description of import pandas as pd
from scipy import stats
# Read data
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/normal-2d.json")
# Generate key statistics
stats_output = {
"Sample size": len(df),
"Means": df.mean().round(3).to_dict(),
"Standard deviations": df.std().round(3).to_dict(),
"Correlation": round(df.corr().iloc[0, 1], 3),
"Ranges": {col: [round(df[col].min(), 3), round(df[col].max(), 3)] for col in df.columns},
"Normality p-values": {col: round(stats.normaltest(df[col]).pvalue, 3) for col in df.columns}
}
# Print statistics
print("Dataset Statistics for Description:")
print(f"Sample size: {stats_output['Sample size']} points")
print(f"Centers: {stats_output['Means']}")
print(f"Standard deviations: {stats_output['Standard deviations']}")
print(f"Correlation: {stats_output['Correlation']}")
print(f"Ranges: {stats_output['Ranges']}")
print(f"Normality test p-values: {stats_output['Normality p-values']}") ...which produced:
|
- related to vega#668
Here is the code to validate the statistical description of uniform-2d.json added in b55b72f import pandas as pd
import numpy as np
from urllib.request import urlopen
import json
# Read data
url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/uniform-2d.json"
data = pd.read_json(url)
# Calculate statistics
stats = {
'count': len(data),
'meanU': data.u.mean(),
'meanV': data.v.mean(),
'minU': data.u.min(),
'maxU': data.u.max(),
'minV': data.v.min(),
'maxV': data.v.max(),
'stdU': data.u.std(),
'stdV': data.v.std(),
'correlation': data.u.corr(data.v)
}
print("Dataset Verification:")
print(f"Count: {stats['count']}")
print(f"\nMeans:")
print(f"u: {stats['meanU']:.6f}")
print(f"v: {stats['meanV']:.6f}")
print(f"\nRanges:")
print(f"u: [{stats['minU']:.6f}, {stats['maxU']:.6f}]")
print(f"v: [{stats['minV']:.6f}, {stats['maxV']:.6f}]")
print(f"\nStandard deviations:")
print(f"u: {stats['stdU']:.6f}")
print(f"v: {stats['stdV']:.6f}")
print(f"\nCorrelation: {stats['correlation']:.6f}") |
- dataset was removed from repo in vega#671
- relies on income.py script from vega#672
…a-datasets into add-missing-metadata
- adds description, sources for sp500.csv - fixes formatting for weekly-weather.json
- uv run scripts/build_datapackage.py # doctest: +SKIP
@dangotbanned After your review (and no rush on that) would you mind mind handling the merge for this one? I've reviewed it thoroughly and it looks good to me, but I'd appreciate your experience on the final step. |
Thanks for the ping @dsmedia Yeah can do, will try to get to this soonish.
There's a somewhat related ruff rule with examples. Some alternatives might be:
Or just skipping that part entirely (e.g. It might seem nitpicky, but taking a step back to consider this usually helps find a consistent "voice" for the docs/descriptions when viewed as a whole Really great work on getting all of this information together, I'm sure it wasn't an easy job |
- Partial fix for vega#663 (comment) - edits descriptions through earthquakes.json
Apt suggestion @dangotbanned, and interesting to see how thoughtfully this kind of formatting style has been considered in PEP 257 – Docstring Conventions. I've |
- avoids 'this dataset' and similar - reruns datapackage script (json, md)
- new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year
@dsmedia I've only gone through the recent diffs, but yeah these changes are looking great 🎉 Side noteI've just pushed an overhaul of (vega/altair#3631) in (vega/altair@b606a7d). |
Pull Request: Add missing descriptions, sources, and licenses to datapackage files
Objective:
Following up on the metadata infrastructure work in #634, #639, and #646, this PR adds missing description, source, and license metadata to datapackage_additions.toml. The checklist below indicates which metadata entries are missing and still need to be added. Schema entries (describing each dataset column) will be handled in a subsequent pull request.
Dataset descriptions written to avoid formulations such as "This dataset...", consistent with PEP 257 – Docstring Conventions
Open Questions:Complete[ ] add disclaimer about license informationdeferring to separate PRStatus:
The following checklist indicates the completion status of the
description
,sources
, andlicenses
metadata for each dataset. A green checkmark (✅) indicates the metadata is present, a red X (❌) indicates the metadata is missing. The leading checkbox is only checked if all three types of metadata are present.Process:
Changes are validated using scripts/build_datapackage.py, which generates machine-readable metadata describing the contents of /data/.
Legend:
✅
- Indicates the metadata is present❌
- Indicates the metadata is missing[x]
- Indicates all three types of metadata are present[ ]
- Indicates one or more types of metadata are missingChecklist: