Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slice_head() and group_by() for occurrence downloads #258

Open
mjwestgate opened this issue Jan 17, 2025 · 2 comments
Open

Support slice_head() and group_by() for occurrence downloads #258

mjwestgate opened this issue Jan 17, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@mjwestgate
Copy link
Collaborator

Our adoption of dplyr syntax opens up the possibility of more intuitive ways of calling the various APIs. It would also make the process more SQL-like, which would be more consistent with GBIF once we complete issue #249.

Two examples that spring to mind are:

  1. supporting group_by() for occurrences. This would be a generalisation of atlas_species() to support any valid field as a facet; by so doing it would also resolve this issue: add rank argument to atlas_species() #195.

Example syntax:

galah_call() |>
  filter(year == 2024) |>
  group_by(class) |> # using species here would == atlas_species()
  collect()
  1. supporting slice_head() for occurrences. This would require enabling the occurrences/search API for actual occurrence calls (currently only used for counts), and would require setting limits (as e.g. ALA only allows 500 records, from memory). This would help people view data before doing a large download.

Example syntax:

galah_call() |>
  filter(year == 2024) |>
  slice_head(n = 5) |>
  collect()
@mjwestgate mjwestgate added the enhancement New feature or request label Jan 17, 2025
@nickdos
Copy link
Contributor

nickdos commented Jan 20, 2025

@mjwestgate and I were discussing this issue in Slack and I suggested the Pipelines Spark interface might be a way of handling these requirements. This blog article provides some insights on how it works. GBIF experimental SQL downloads might be relevant too.

@mjwestgate
Copy link
Collaborator Author

Just flagging that supporting SQL downloads would not only solve this problem, but would also be consistent with our intention to move over to SQL for GBIF downloads in future (#249). Some discussion of options within the current API schema happened in biocache-service issue 829 in late 2023 but without deciding on an outcome.

mjwestgate added a commit that referenced this issue Feb 3, 2025
This basically generalises `atlas_species()` to support any facet, but only if you use `dplyr` syntax. This was deemed to make more sense than adding a `rank` argument to `atlas_species()`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants