Support `slice_head()` and `group_by()` for occurrence downloads #258

mjwestgate · 2025-01-17T05:41:42Z

Our adoption of dplyr syntax opens up the possibility of more intuitive ways of calling the various APIs. It would also make the process more SQL-like, which would be more consistent with GBIF once we complete issue #249.

Two examples that spring to mind are:

supporting group_by() for occurrences. This would be a generalisation of atlas_species() to support any valid field as a facet; by so doing it would also resolve this issue: add rank argument to atlas_species() #195.

Example syntax:

galah_call() |>
  filter(year == 2024) |>
  group_by(class) |> # using species here would == atlas_species()
  collect()

supporting slice_head() for occurrences. This would require enabling the occurrences/search API for actual occurrence calls (currently only used for counts), and would require setting limits (as e.g. ALA only allows 500 records, from memory). This would help people view data before doing a large download.

Example syntax:

galah_call() |>
  filter(year == 2024) |>
  slice_head(n = 5) |>
  collect()

The text was updated successfully, but these errors were encountered:

nickdos · 2025-01-20T01:40:54Z

@mjwestgate and I were discussing this issue in Slack and I suggested the Pipelines Spark interface might be a way of handling these requirements. This blog article provides some insights on how it works. GBIF experimental SQL downloads might be relevant too.

mjwestgate · 2025-01-20T02:41:12Z

Just flagging that supporting SQL downloads would not only solve this problem, but would also be consistent with our intention to move over to SQL for GBIF downloads in future (#249). Some discussion of options within the current API schema happened in biocache-service issue 829 in late 2023 but without deciding on an outcome.

This basically generalises `atlas_species()` to support any facet, but only if you use `dplyr` syntax. This was deemed to make more sense than adding a `rank` argument to `atlas_species()`

mjwestgate added the enhancement New feature or request label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `slice_head()` and `group_by()` for occurrence downloads #258

Support `slice_head()` and `group_by()` for occurrence downloads #258

mjwestgate commented Jan 17, 2025

nickdos commented Jan 20, 2025 •

edited

Loading

mjwestgate commented Jan 20, 2025

Support slice_head() and group_by() for occurrence downloads #258

Support slice_head() and group_by() for occurrence downloads #258

Comments

mjwestgate commented Jan 17, 2025

nickdos commented Jan 20, 2025 • edited Loading

mjwestgate commented Jan 20, 2025

Support `slice_head()` and `group_by()` for occurrence downloads #258

Support `slice_head()` and `group_by()` for occurrence downloads #258

nickdos commented Jan 20, 2025 •

edited

Loading