-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add generation script for us-state-capitals.json #668
Conversation
slight adjustment in lon/lat but these figures now have a traceable source
Thanks for the ping @dsmedia I suspected this might end up being needed:
|
def request_sha( | |
ref: str = "main", /, *, api_version: str = "2022-11-28" | |
) -> Mapping[str, str]: | |
""" | |
Use `Get a tree`_ to retrieve a hash for each dataset. | |
Parameters | |
---------- | |
ref | |
The SHA1 value or ref (`branch`_ or `tag`_) name of the tree. | |
api_version | |
The `GitHub REST API version`_. | |
Returns | |
------- | |
Mapping from `Resource.path`_ to `Resource.hash`_. | |
.. _Get a tree: | |
https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree | |
.. _branch: | |
https://github.com/vega/vega-datasets/branches | |
.. _tag: | |
https://github.com/vega/vega-datasets/tags | |
.. _GitHub REST API version: | |
https://docs.github.com/en/rest/about-the-rest-api/api-versions?apiVersion=2022-11-28 | |
.. _Resource.path: | |
https://datapackage.org/standard/data-resource/#path-or-data | |
.. _Resource.hash: | |
https://datapackage.org/standard/data-resource/#hash | |
""" | |
DATA = "data" | |
TREES = "https://api.github.com/repos/vega/vega-datasets/git/trees" | |
headers = {"X-GitHub-Api-Version": api_version} | |
url = f"{TREES}/{ref}" | |
msg = f"Retrieving sha values from {url!r}" | |
logger.info(msg) | |
with niquests.get(url, headers=headers) as resp: | |
root = resp.json() | |
query = (tree["url"] for tree in root["tree"] if tree["path"] == DATA) | |
if data_url := next(query, None): | |
with niquests.get(data_url, headers=headers) as resp: | |
trees = resp.json() | |
return {t["path"]: _to_hash(t["sha"]) for t in trees["tree"]} | |
msg = f"Did not find a tree for {DATA!r} in response:\n{root!r}" |
Locally, could you try changing this line?
vega-datasets/scripts/build_datapackage.py
Line 618 in 9176bda
gh_sha1 = request_sha("main") |
Using this instead should point to the changes on your branch:
gh_sha1 = request_sha("add-script-statecapitals")
If that works, then I can try parameterizing request_sha(ref=...)
with something like https://github.com/vega/altair/blob/aaca9bccc07e4c372077608621424c2b9925d574/tools/sync_website.py#L102-L103
An alternative would be working with git
directly, like:
cd data
$current_branch = git branch --show-current
git ls-tree $current_branch
Output
I'm on main
when I ran this.
The same commands should give you a new hash for us-state-capitals.json
on add-script-statecapitals
100644 blob 6586d6c00887cd48850099c174a42bb1677ade0c 7zip.png
100644 blob 608ba6d51fa70584c3fa1d31eb94533302553838 airports.csv
100644 blob 719e73406cfc08f16dda651513ae1113edd75845 annual-precip.json
100644 blob 11ae97090b6263bdf0c8661156a44a5b782e0787 anscombe.json
100644 blob 8dc50de2509b6e197ce95c24c98f90d9d1ab138c barley.json
100644 blob 1b8b190c9bc02ef7bcbfe5a8a70f61b1616d3f6c birdstrikes.csv
100644 blob 5b18c08b28fb782f54ca98ce6a1dd220f269adf1 budget.json
100644 blob 8a909e24f698a3b0f6c637c30ec95e7e17df7ef6 budgets.json
100644 blob d8a82abaad7dba4f9cd8cee402ba3bf07e70d0e4 burtin.json
100644 blob 1d56d3fa6da01af9ece2d6397892fe5bb6f47c3d cars.json
100644 blob b8715cbd2a8d0c139020a73fdb4d231f8bde193a co2-concentration.csv
100644 blob 0070959b7f1a09475baa5099098240ae81026e72 countries.json
100644 blob d2df500c612051a21fe324237a465a62d5fe01b6 crimea.json
100755 blob 0584ed86190870b0089d9ea67c94f3dd3feb0ec8 disasters.csv
100644 blob 33d0afc57fb1005e69cd3e8a6c77a26670d91979 driving.json
100644 blob ed4c47436c09d5cc5f428c233fbd8074c0346fd0 earthquakes.json
100644 blob 0691709484a75e9d8ee55a22b1980d67d239c2c4 ffox.png
100644 blob 10bbe538daaa34014cd5173b331f7d3c10bfda49 flare-dependencies.json
100644 blob d232ea60f875de87a7d8fc414876e19356a98b6b flare.json
100644 blob 769a34f3d0442be8f356651463fe925ad8b3759d flights-10k.json
100644 blob 74f6b3cf8b779e3ff204be2f5a9762763d50a095 flights-200k.arrow
100644 blob 4722e02637cf5f38ad9ea5d1f48cae7872dce22d flights-200k.json
100644 blob 20c920b46db4f664bed3e1420b8348527cd7c41e flights-20k.json
100644 blob d9221dc7cd477209bf87e680be3c881d8fee53cd flights-2k.json
100644 blob 9c4e0b480a1a60954a7e5c6bcc43e1c91a73caaa flights-3m.parquet
100644 blob 8459fa09e3ba8197928b5dba0b9f5cc380629758 flights-5k.json
100644 blob 0ba03114891e97cfc3f83d9e3569259e7f07af7b flights-airport.csv
100644 blob d07898748997b9716ae699e9c2d5b91b4bb48a51 football.json
100644 blob abce37a932917085023a345b1a004396e9355ac3 gapminder-health-income.csv
100644 blob 8cb2f0fc23ce612e5f0c7bbe3dcac57f6764b7b3 gapminder.json
100644 blob cf0505dd72eb52558f6f71bd6f43663df4f2f82c gimp.png
100644 blob 18547064dd687c328ea2fb5023cae6417ca6f050 github.csv
100644 blob 01a4f05ed45ce939307dcd9bc4e75ed5cd1ab202 global-temp.csv
100644 blob ebfd02fd584009ee391bfc5d33972e4c94f507ab income.json
100644 blob 214238f23d7a57e3398f4e9f1e87e61abb23cafc iowa-electricity.csv
100644 blob 69d386f47305f4d8fd2886e805004fbdd71568e9 jobs.json
100644 blob 94ee8ad8198d2954f77e3a98268d8b1f7fe7d086 la-riots.csv
100644 blob d90805055ffdfe5163a7655c4847dc61df45f92b londonBoroughs.json
100644 blob 2e24c01140cfbcad5e1c859be6df4efebca2fbf5 londonCentroids.json
100644 blob 1b21ea5339320090b106082bd9d39a1055aadb18 londonTubeLines.json
100644 blob 741df36729a9d84d18ec42f23a386b53e7e3c428 lookup_groups.csv
100644 blob c79f69afb3ff81a0c8ddc01f5cf2f078e288457c lookup_people.csv
100644 blob a8b0faaa94c7425c49fe36ea1a93319430fec426 miserables.json
100644 blob 921dfa487a4198cfe78f743aa0aa87ad921642df monarchs.json
100644 blob e38178f99454568c5160fc759184a1a1471cc558 movies.json
100644 blob 4303306ec275209fcba008cbd3a5f29c9e612424 normal-2d.json
100644 blob 6da8129ed0b0333c88302e153824b06f7859aac9 obesity.json
100644 blob 9b3d93e8479d3ddeee29b5e22909132346ac0a3b ohlc.json
100644 blob 517b6d3267174b1b65691a37cbd59c1739155866 penguins.json
100644 blob 01df4411cb16bf758fe8ffa6529507419189edc2 platformer-terrain.json
100644 blob 4716a117308962f3596179d7d7d2ad729a19cda7 points.json
100644 blob 4aa2e19fa392cc9448aa8ffbdad15b014371f499 political-contributions.json
100644 blob 680fd336e777314198450721c31227a11f02411f population.json
100644 blob 3bad66ef911b93c641edc21f2034302348bffaf9 population_engineers_hurricanes.csv
100644 blob d55461adc9742bb061f6072b694aaf73e8b529db seattle-weather-hourly-normals.csv
100644 blob 0f38b53bdc1c42c5e5d484f33b9d4d7b229e0e59 seattle-weather.csv
100644 blob b82f20656d0521801db7c5599a6c990415a8aaff sp500-2000.csv
100644 blob 0eb287fb7c207f4ed392821d67a92267180fc8cf sp500.csv
100644 blob 58e2ce1bed01eeebe29f5b4be32344aaec5532c0 stocks.csv
100644 blob 65675107d81c19ffab260ac1f235f3e477fe8982 udistrict.json
100644 blob 4d769356c95c40a9807a7d048ab81aa56ae77df0 unemployment-across-industries.json
100644 blob d1aca19c4821fdc3b4270989661a1787d38588d0 unemployment.tsv
100644 blob c6120dd8887a0841a9fcc31e247463dbd3d0a996 uniform-2d.json
100644 blob ff7a7e679c46f2d1eb85cc92521b990f1a7a5c7a us-10m.json
100644 blob 8795be57cf1e004f4ecba44cab2b324a074330df us-employment.csv
100644 blob 9c3211c5058c899412c30f5992a77c54a1b80066 us-state-capitals.json
100644 blob 841151dbfbc5f6db3e19904557abd7a7aad0efd2 volcano.json
100644 blob 0e7e853f4c5b67615da261d5d343824a43510f50 weather.csv
100644 blob bd42a3e2403e7ccd6baaa89f93e7f0c164e0c185 weekly-weather.json
100644 blob cde46b43fc82f4c3c2a37ddcfe99fd5f4d8d8791 wheat.json
100644 blob ed686b0ba613abd59d09fcd946b5030a918b8154 windvectors.csv
100644 blob a1ce852de6f2713c94c0c284039506ca2d4f3dee world-110m.json
100644 blob d3df33e12be0d0544c95f1bd47005add4b7010be zipcodes.csv
Either way, I'll try and get a PR up today so we can merge that ahead of this PR 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @dsmedia
I'll look into #668 (comment) some more today
Let me know if you need help with the caching stuff - or if you have any alternative ideas you wanna talk through 👍
relies on usgs public domain service new lookup file, us-state-codes.json
Incorporates latest build_datapackage.py changes from PR #669 which improves hash calculation using local git commands uvx ruff check is run
- Logical flow is `Feature` -> `CapitolFeature` -> `StateCapitol` - Made `Feature` generic to promote reuse (e.g. #667) - Ensure an exit code is produced when `get_state_capitols` fails - Previously would print to console, but wouldn't block a task runner/CI - Move the territory filter into the query - Previously requested more than we wanted - Added references to things I needed context for (new to spatial data)
@dsmedia I feel like I got carried away with this one in (10ae1d2). It started out as several comments - but I thought I'd give it a try for myself given the timezone difference I'm happy for you to revert back to (9249a47) as both produce identical outputs. I left some notes in the commit description, which might be helpful if any of the changes seem strange to you. Well done on finding the new source! |
DIscovered this via the file size increasing in 10ae1d2
Changes look great. Thank you, @dangotbanned |
No problem @dsmedia |
* docs: add sources and license for 7zip resource Update datasets.toml with missing source metadata for 7zip.png dataset * chore: uvx taplo fmt * docs: add sources and license for ffox.png * docs: updates zipcodes.csv resource in datapackage_additions.toml * update world-110m.json * docs: updates us-10m.json * docs: updates wheat.json - adds citation to protovis in desscription - fixes link to image in sources - adds license * docs: adds missing license data to several - fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json * docs: update metadata for co2-concerntration.csv - expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source * docs: adds license to crimea.json metadata * docs: update metadata for earthquakes.json - expands description - adds license * docs: complete metadata for flights* datasets - Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms * docs: updates london dataset metadata - adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json * docs: adds government and IPUMS license metadata to several - global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites. * docs: adds 'undetermined' licenses and sources - adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license) * docs: compltes anscombe.json metadata - updates description, adds sources and * docs: adds budgets.json metadata - adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined * docs: adds basic metadata to flare*.json datasets - focuses on how data is used in edge bundling example - would benefit from additional detail in the description * docs: completes flights-airport.csv metadata - corrects description, adds source, license * docs: update several file metadata entries - ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description) * docs: adds us-state-capitals.json metadata - related to #668 * docs: adds uniform-2d.json metadata * docs: adds obesity.json metadata * docs: remove points.json metadata - dataset was removed from repo in #671 * docs: adds metadata for income.json - relies on income.py script from #672 * docs: adds metadata for udistrict.json * docs: adds, fixes metadata - adds description, sources for sp500.csv - fixes formatting for weekly-weather.json * docs: updates datapackage - uv run scripts/build_datapackage.py # doctest: +SKIP * docs: begins to recast in PEP 257 style - Partial fix for #663 (comment) - edits descriptions through earthquakes.json * docs: recasts all in PEP 257 format - avoids 'this dataset' and similar - reruns datapackage script (json, md) * fix: corrects year in description of obesity.json - new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year * fix: Use correct heading level in `burtin.json` Drive-by fix, really been bugging me that this breaks the flow of the navigation * fix: remove extra space Co-authored-by: Dan Redding <[email protected]> * fix: remove extra space from source Co-authored-by: Dan Redding <[email protected]> * docs: add column schema to normal-2d.json metadata Co-authored-by: Dan Redding <[email protected]> * reformats revision note for monarchs.json Co-authored-by: Dan Redding <[email protected]> * fix: typo in monarchs.json metadata * adjust markdown in anscombe.json Co-authored-by: Dan Redding <[email protected]> * adjust punctuation in anscombe.json * adds column schema for budgets.json, penguins.json - runs build_datapackage.py to verify * docs: removes 'undetermined' source and license info - source and license can be clarified in a future PR * fix: correct lookup example url * docs: moves gapminder clusters to schema * update file markdown in flare.json metadata Co-authored-by: Dan Redding <[email protected]> * docs: adjust markdown in flare-dependencies.json metadata Co-authored-by: Dan Redding <[email protected]> * docs: reformats driving.json metadata * fix formatting * adjustments to schemas - github.csv: move time range to schema - add categories to schema in seattle-weather.csv - sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema - reformat usgs disclaimer in us-state-capitals.json - rerun build_datapackage.py * remove duplication in udistrict description * uvx run scripts --------- Co-authored-by: Dan Redding <[email protected]>
us-state-capitals.json
dataset usingOpenStreetMapUSGS data. This prepares for the upcoming metadata improvements in docs: Add missing descriptions, sources, and licenses #663, which will properly document the source (OpenStreetMapUSGS) and license (ODbLPublic Domain) for the geographic coordinates. The source/license was previously unspecified for this dataset.Small differences between legacy (unsourced) and new scripted (OpenStreetMaps) geographic coordinates
@dangotbanned looking at the diff, i noticed the hash of this dataset did not change in the datapackage file even after the json itself was changed. just confirming this is expected.