Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add generation script for us-state-capitals.json #668

Merged
merged 12 commits into from
Jan 24, 2025

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Jan 21, 2025

  • Adds a script to generate the us-state-capitals.json dataset using OpenStreetMap USGS data. This prepares for the upcoming metadata improvements in docs: Add missing descriptions, sources, and licenses #663, which will properly document the source (OpenStreetMap USGS) and license (ODbLPublic Domain) for the geographic coordinates. The source/license was previously unspecified for this dataset.
Small differences between legacy (unsourced) and new scripted (OpenStreetMaps) geographic coordinates
  • Alabama: 2.07 miles
  • Alaska: 0.00 miles
  • Arizona: 0.02 miles
  • Arkansas: 2.47 miles
  • California: 2.22 miles
  • Colorado: 0.04 miles
  • Connecticut: 0.73 miles
  • Delaware: 0.29 miles
  • Florida: 1.07 miles
  • Georgia: 0.76 miles
  • Hawaii: 1.92 miles
  • Idaho: 1.85 miles
  • Illinois: 1.14 miles
  • Indiana: 1.66 miles
  • Iowa: 0.35 miles
  • Kansas: 0.91 miles
  • Kentucky: 0.61 miles
  • Louisiana: 2.85 miles
  • Maine: 0.61 miles
  • Maryland: 0.60 miles
  • Massachusetts: 8.48 miles
  • Michigan: 0.40 miles
  • Minnesota: 0.05 miles
  • Mississippi: 1.97 miles
  • Missouri: 0.96 miles
  • Montana: 0.49 miles
  • Nebraska: 1.70 miles
  • Nevada: 0.80 miles
  • New Hampshire: 1.07 miles
  • New Jersey: 0.52 miles
  • New Mexico: 2.03 miles
  • New York: 1.47 miles
  • North Carolina: 0.65 miles
  • North Dakota: 0.75 miles
  • Ohio: 0.00 miles
  • Oklahoma: 1.20 miles
  • Oregon: 0.59 miles
  • Pennsylvania: 0.60 miles
  • Rhode Island: 0.48 miles
  • South Carolina: 0.11 miles
  • South Dakota: 0.73 miles
  • Tennessee: 0.57 miles
  • Texas: 0.49 miles
  • Utah: 0.46 miles
  • Vermont: 0.45 miles
  • Virginia: 1.41 miles
  • Washington: 0.21 miles
  • West Virginia: 0.08 miles
  • Wisconsin: 0.03 miles
  • Wyoming: 1.02 miles

@dangotbanned looking at the diff, i noticed the hash of this dataset did not change in the datapackage file even after the json itself was changed. just confirming this is expected.

slight adjustment in lon/lat but these figures now have a traceable source
@dsmedia dsmedia marked this pull request as ready for review January 21, 2025 03:51
@dangotbanned
Copy link
Member

@dangotbanned looking at the diff, i noticed the hash of this dataset did not change in the datapackage file even after the json itself was changed. just confirming this is expected.

Thanks for the ping @dsmedia

I suspected this might end up being needed:

build_datapackage.request_sha()

def request_sha(
ref: str = "main", /, *, api_version: str = "2022-11-28"
) -> Mapping[str, str]:
"""
Use `Get a tree`_ to retrieve a hash for each dataset.
Parameters
----------
ref
The SHA1 value or ref (`branch`_ or `tag`_) name of the tree.
api_version
The `GitHub REST API version`_.
Returns
-------
Mapping from `Resource.path`_ to `Resource.hash`_.
.. _Get a tree:
https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree
.. _branch:
https://github.com/vega/vega-datasets/branches
.. _tag:
https://github.com/vega/vega-datasets/tags
.. _GitHub REST API version:
https://docs.github.com/en/rest/about-the-rest-api/api-versions?apiVersion=2022-11-28
.. _Resource.path:
https://datapackage.org/standard/data-resource/#path-or-data
.. _Resource.hash:
https://datapackage.org/standard/data-resource/#hash
"""
DATA = "data"
TREES = "https://api.github.com/repos/vega/vega-datasets/git/trees"
headers = {"X-GitHub-Api-Version": api_version}
url = f"{TREES}/{ref}"
msg = f"Retrieving sha values from {url!r}"
logger.info(msg)
with niquests.get(url, headers=headers) as resp:
root = resp.json()
query = (tree["url"] for tree in root["tree"] if tree["path"] == DATA)
if data_url := next(query, None):
with niquests.get(data_url, headers=headers) as resp:
trees = resp.json()
return {t["path"]: _to_hash(t["sha"]) for t in trees["tree"]}
msg = f"Did not find a tree for {DATA!r} in response:\n{root!r}"

Locally, could you try changing this line?

gh_sha1 = request_sha("main")

Using this instead should point to the changes on your branch:

    gh_sha1 = request_sha("add-script-statecapitals")

If that works, then I can try parameterizing request_sha(ref=...) with something like https://github.com/vega/altair/blob/aaca9bccc07e4c372077608621424c2b9925d574/tools/sync_website.py#L102-L103

An alternative would be working with git directly, like:

cd data
$current_branch = git branch --show-current
git ls-tree $current_branch
Output

I'm on main when I ran this.
The same commands should give you a new hash for us-state-capitals.json on add-script-statecapitals

100644 blob 6586d6c00887cd48850099c174a42bb1677ade0c    7zip.png
100644 blob 608ba6d51fa70584c3fa1d31eb94533302553838    airports.csv
100644 blob 719e73406cfc08f16dda651513ae1113edd75845    annual-precip.json
100644 blob 11ae97090b6263bdf0c8661156a44a5b782e0787    anscombe.json
100644 blob 8dc50de2509b6e197ce95c24c98f90d9d1ab138c    barley.json
100644 blob 1b8b190c9bc02ef7bcbfe5a8a70f61b1616d3f6c    birdstrikes.csv
100644 blob 5b18c08b28fb782f54ca98ce6a1dd220f269adf1    budget.json
100644 blob 8a909e24f698a3b0f6c637c30ec95e7e17df7ef6    budgets.json
100644 blob d8a82abaad7dba4f9cd8cee402ba3bf07e70d0e4    burtin.json
100644 blob 1d56d3fa6da01af9ece2d6397892fe5bb6f47c3d    cars.json
100644 blob b8715cbd2a8d0c139020a73fdb4d231f8bde193a    co2-concentration.csv
100644 blob 0070959b7f1a09475baa5099098240ae81026e72    countries.json
100644 blob d2df500c612051a21fe324237a465a62d5fe01b6    crimea.json
100755 blob 0584ed86190870b0089d9ea67c94f3dd3feb0ec8    disasters.csv
100644 blob 33d0afc57fb1005e69cd3e8a6c77a26670d91979    driving.json
100644 blob ed4c47436c09d5cc5f428c233fbd8074c0346fd0    earthquakes.json
100644 blob 0691709484a75e9d8ee55a22b1980d67d239c2c4    ffox.png
100644 blob 10bbe538daaa34014cd5173b331f7d3c10bfda49    flare-dependencies.json
100644 blob d232ea60f875de87a7d8fc414876e19356a98b6b    flare.json
100644 blob 769a34f3d0442be8f356651463fe925ad8b3759d    flights-10k.json
100644 blob 74f6b3cf8b779e3ff204be2f5a9762763d50a095    flights-200k.arrow
100644 blob 4722e02637cf5f38ad9ea5d1f48cae7872dce22d    flights-200k.json
100644 blob 20c920b46db4f664bed3e1420b8348527cd7c41e    flights-20k.json
100644 blob d9221dc7cd477209bf87e680be3c881d8fee53cd    flights-2k.json
100644 blob 9c4e0b480a1a60954a7e5c6bcc43e1c91a73caaa    flights-3m.parquet
100644 blob 8459fa09e3ba8197928b5dba0b9f5cc380629758    flights-5k.json
100644 blob 0ba03114891e97cfc3f83d9e3569259e7f07af7b    flights-airport.csv
100644 blob d07898748997b9716ae699e9c2d5b91b4bb48a51    football.json
100644 blob abce37a932917085023a345b1a004396e9355ac3    gapminder-health-income.csv
100644 blob 8cb2f0fc23ce612e5f0c7bbe3dcac57f6764b7b3    gapminder.json
100644 blob cf0505dd72eb52558f6f71bd6f43663df4f2f82c    gimp.png
100644 blob 18547064dd687c328ea2fb5023cae6417ca6f050    github.csv
100644 blob 01a4f05ed45ce939307dcd9bc4e75ed5cd1ab202    global-temp.csv
100644 blob ebfd02fd584009ee391bfc5d33972e4c94f507ab    income.json
100644 blob 214238f23d7a57e3398f4e9f1e87e61abb23cafc    iowa-electricity.csv
100644 blob 69d386f47305f4d8fd2886e805004fbdd71568e9    jobs.json
100644 blob 94ee8ad8198d2954f77e3a98268d8b1f7fe7d086    la-riots.csv
100644 blob d90805055ffdfe5163a7655c4847dc61df45f92b    londonBoroughs.json
100644 blob 2e24c01140cfbcad5e1c859be6df4efebca2fbf5    londonCentroids.json
100644 blob 1b21ea5339320090b106082bd9d39a1055aadb18    londonTubeLines.json
100644 blob 741df36729a9d84d18ec42f23a386b53e7e3c428    lookup_groups.csv
100644 blob c79f69afb3ff81a0c8ddc01f5cf2f078e288457c    lookup_people.csv
100644 blob a8b0faaa94c7425c49fe36ea1a93319430fec426    miserables.json
100644 blob 921dfa487a4198cfe78f743aa0aa87ad921642df    monarchs.json
100644 blob e38178f99454568c5160fc759184a1a1471cc558    movies.json
100644 blob 4303306ec275209fcba008cbd3a5f29c9e612424    normal-2d.json
100644 blob 6da8129ed0b0333c88302e153824b06f7859aac9    obesity.json
100644 blob 9b3d93e8479d3ddeee29b5e22909132346ac0a3b    ohlc.json
100644 blob 517b6d3267174b1b65691a37cbd59c1739155866    penguins.json
100644 blob 01df4411cb16bf758fe8ffa6529507419189edc2    platformer-terrain.json
100644 blob 4716a117308962f3596179d7d7d2ad729a19cda7    points.json
100644 blob 4aa2e19fa392cc9448aa8ffbdad15b014371f499    political-contributions.json
100644 blob 680fd336e777314198450721c31227a11f02411f    population.json
100644 blob 3bad66ef911b93c641edc21f2034302348bffaf9    population_engineers_hurricanes.csv
100644 blob d55461adc9742bb061f6072b694aaf73e8b529db    seattle-weather-hourly-normals.csv
100644 blob 0f38b53bdc1c42c5e5d484f33b9d4d7b229e0e59    seattle-weather.csv
100644 blob b82f20656d0521801db7c5599a6c990415a8aaff    sp500-2000.csv
100644 blob 0eb287fb7c207f4ed392821d67a92267180fc8cf    sp500.csv
100644 blob 58e2ce1bed01eeebe29f5b4be32344aaec5532c0    stocks.csv
100644 blob 65675107d81c19ffab260ac1f235f3e477fe8982    udistrict.json
100644 blob 4d769356c95c40a9807a7d048ab81aa56ae77df0    unemployment-across-industries.json
100644 blob d1aca19c4821fdc3b4270989661a1787d38588d0    unemployment.tsv
100644 blob c6120dd8887a0841a9fcc31e247463dbd3d0a996    uniform-2d.json
100644 blob ff7a7e679c46f2d1eb85cc92521b990f1a7a5c7a    us-10m.json
100644 blob 8795be57cf1e004f4ecba44cab2b324a074330df    us-employment.csv
100644 blob 9c3211c5058c899412c30f5992a77c54a1b80066    us-state-capitals.json
100644 blob 841151dbfbc5f6db3e19904557abd7a7aad0efd2    volcano.json
100644 blob 0e7e853f4c5b67615da261d5d343824a43510f50    weather.csv
100644 blob bd42a3e2403e7ccd6baaa89f93e7f0c164e0c185    weekly-weather.json
100644 blob cde46b43fc82f4c3c2a37ddcfe99fd5f4d8d8791    wheat.json
100644 blob ed686b0ba613abd59d09fcd946b5030a918b8154    windvectors.csv
100644 blob a1ce852de6f2713c94c0c284039506ca2d4f3dee    world-110m.json
100644 blob d3df33e12be0d0544c95f1bd47005add4b7010be    zipcodes.csv

Either way, I'll try and get a PR up today so we can merge that ahead of this PR 🙂

Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @dsmedia

I'll look into #668 (comment) some more today

Let me know if you need help with the caching stuff - or if you have any alternative ideas you wanna talk through 👍

scripts/us-state-capitals.py Outdated Show resolved Hide resolved
scripts/us-state-capitals.py Outdated Show resolved Hide resolved
scripts/us-state-capitals.py Outdated Show resolved Hide resolved
scripts/us-state-capitals.py Outdated Show resolved Hide resolved
scripts/us-state-capitals.py Outdated Show resolved Hide resolved
@dangotbanned dangotbanned reopened this Jan 21, 2025
@dsmedia dsmedia marked this pull request as draft January 22, 2025 03:22
relies on usgs public domain service
new lookup file, us-state-codes.json
Incorporates latest build_datapackage.py changes from PR #669
which improves hash calculation using local git commands
uvx ruff check is run
@dsmedia dsmedia marked this pull request as ready for review January 23, 2025 03:59
@dsmedia dsmedia requested a review from dangotbanned January 23, 2025 03:59
@dangotbanned dangotbanned self-assigned this Jan 23, 2025
- Logical flow is `Feature` -> `CapitolFeature` -> `StateCapitol`
- Made `Feature` generic to promote reuse (e.g. #667)
- Ensure an exit code is produced when `get_state_capitols` fails
  - Previously would print to console, but wouldn't block a task runner/CI
- Move the territory filter into the query
  - Previously requested more than we wanted
- Added references to things I needed context for (new to spatial data)
@dangotbanned
Copy link
Member

dangotbanned commented Jan 23, 2025

@dsmedia I feel like I got carried away with this one in (10ae1d2). It started out as several comments - but I thought I'd give it a try for myself given the timezone difference

I'm happy for you to revert back to (9249a47) as both produce identical outputs.

Note

(c8f3056) is still needed for cross-platform use to avoid (#653)

I left some notes in the commit description, which might be helpful if any of the changes seem strange to you.

Well done on finding the new source!
Seems there is quite a lot of flexibility in how you can parameterize the request

@dangotbanned dangotbanned removed their assignment Jan 23, 2025
@dsmedia
Copy link
Collaborator Author

dsmedia commented Jan 23, 2025

@dsmedia I feel like I got carried away with this one in (10ae1d2). It started out as several comments - but I thought I'd give it a try for myself given the timezone difference

I'm happy for you to revert back to (9249a47) as both produce identical outputs.

Note

(c8f3056) is still needed for cross-platform use to avoid (#653)

I left some notes in the commit description, which might be helpful if any of the changes seem strange to you.

Well done on finding the new source! Seems there is quite a lot of flexibility in how you can parameterize the request

Changes look great. Thank you, @dangotbanned

@dangotbanned
Copy link
Member

Changes look great. Thank you, @dangotbanned

No problem @dsmedia
Really appreciate all the work you're putting into https://github.com/vega/vega-datasets

@dangotbanned dangotbanned merged commit dd43f29 into main Jan 24, 2025
2 checks passed
@dangotbanned dangotbanned deleted the add-script-statecapitals branch January 24, 2025 09:28
dsmedia added a commit that referenced this pull request Feb 2, 2025
* docs: add sources and license for 7zip resource

Update datasets.toml with missing source metadata for 7zip.png dataset

* chore: uvx taplo fmt

* docs: add sources and license for ffox.png

* docs: updates zipcodes.csv resource in datapackage_additions.toml

* update world-110m.json

* docs: updates us-10m.json

* docs: updates wheat.json
- adds citation to protovis in desscription
- fixes link to image in sources
- adds license

* docs: adds missing license data to several
- fixes bad link in annual-precip.json; adds license
- adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

* docs: update metadata for co2-concerntration.csv
- expands description to explain units and seasonal adjustment
- adds additional source directly to dataset csv
- adds license details from source

* docs: adds license to crimea.json metadata

* docs: update metadata for earthquakes.json
- expands description
- adds license

* docs: complete metadata for flights* datasets

- Document that data used in flights* datasets are collected under US DOT requirements
- Add row counts to flight dataset descriptions (2k-3M rows)
- Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

* docs: updates london dataset metadata
- adds license for londonBoroughs.json
- adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json)
- expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

* docs: adds government and IPUMS license metadata to several
- global-temp.csv
- iowa-electricity.csv
- jobs.json
- monarchs.json
- political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license
- population_engineers_hurricanes.csv
- seattle-weather-hourly-normals.csv
- seattle-weather.csv
- unemployment-across-industries.json
- unemployment.tsv
- us-employment.csv
- weather.csv

Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.

* docs: adds 'undetermined' licenses and sources

- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json)
- airports.csv (adds description, sources, license)
- barley.csv (updates description and source; adds license)
- disasters.csv (expands description, updates sources, add license)
- driving.json (adds description, updates source, adds license)
- ohlc.json (modifies description, adds additional source, and license)
- stocks.csv (adds source, license)
- weekly-weather.json (adds source, license)
- windvectors.csv (adds source, license)

* docs: compltes anscombe.json metadata
- updates description, adds sources and

* docs: adds budgets.json metadata
- adds description, source and license
- makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

* docs: adds basic metadata to flare*.json datasets
- focuses on how data is used in edge bundling example
- would benefit from additional detail in the description

* docs: completes flights-airport.csv metadata
- corrects description, adds source, license

* docs: update several file metadata entries
- ffox.png (updates license)
- gapminder.json (adds license)
- gimp.png (updates description, adds source, license)
- github.csv (adds description, source, license)
- lookup_groups.csv, lookup_people.csv (adds description, source, license)
- miserables.json adds description, source, license)
- movies.json (adds source, license)
- normal-2d.json (adds description, source, license)
- stocks.csv (adds description)

* docs: adds us-state-capitals.json metadata
- related to #668

* docs: adds uniform-2d.json metadata

* docs: adds obesity.json metadata

* docs: remove points.json metadata
- dataset was removed from repo in #671

* docs: adds metadata for income.json
- relies on income.py script from #672

* docs: adds metadata for udistrict.json

* docs: adds, fixes metadata
- adds description, sources for sp500.csv
- fixes formatting for weekly-weather.json

* docs: updates datapackage
- uv run scripts/build_datapackage.py # doctest: +SKIP

* docs: begins to recast in PEP 257 style
- Partial fix for #663 (comment)
- edits descriptions through earthquakes.json

* docs: recasts all in PEP 257 format
- avoids 'this dataset' and similar
- reruns datapackage script (json, md)

* fix: corrects year in description of obesity.json
- new source found confirming 1995, not 2008, data is shown, consistent with CDC data
- removes link to vega example that references wrong source year

* fix: Use correct heading level in `burtin.json`

Drive-by fix, really been bugging me that this breaks the flow of the navigation

* fix: remove extra space

Co-authored-by: Dan Redding <[email protected]>

* fix: remove extra space from source

Co-authored-by: Dan Redding <[email protected]>

* docs: add column schema to normal-2d.json metadata

Co-authored-by: Dan Redding <[email protected]>

* reformats revision note for monarchs.json

Co-authored-by: Dan Redding <[email protected]>

* fix: typo in monarchs.json metadata

* adjust markdown in anscombe.json

Co-authored-by: Dan Redding <[email protected]>

* adjust punctuation in anscombe.json

* adds column schema for budgets.json, penguins.json
- runs build_datapackage.py to verify

* docs: removes 'undetermined' source and license info
- source and license can be clarified in a future PR

* fix: correct lookup example url

* docs: moves gapminder clusters to schema

* update file markdown in flare.json metadata

Co-authored-by: Dan Redding <[email protected]>

* docs: adjust markdown in flare-dependencies.json metadata

Co-authored-by: Dan Redding <[email protected]>

* docs: reformats driving.json metadata

* fix formatting

* adjustments to schemas
- github.csv: move time range to schema
- add categories to schema in seattle-weather.csv
- sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema
- reformat usgs disclaimer in us-state-capitals.json
- rerun build_datapackage.py

* remove duplication in udistrict description

* uvx run scripts

---------

Co-authored-by: Dan Redding <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants