Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy on intake for larger data assets #14

Closed
jlstevens opened this issue May 21, 2019 · 5 comments
Closed

Policy on intake for larger data assets #14

jlstevens opened this issue May 21, 2019 · 5 comments

Comments

@jlstevens
Copy link
Contributor

The opensky notebook is an example of a topic that requires a large data file (opensky.parq). I would like to suggest using intake in such cases instead of the current approach that will need updating anyway (right now it relies on datashader's system for getting data files).

@jbednar
Copy link
Contributor

jbednar commented May 21, 2019

I think using intake is a good default, but we should probably have a couple of examples that show other ways of getting data files.

@jsignell
Copy link
Contributor

jsignell commented May 22, 2019

I have an approach that I've already been doing. See #13 for instance. All that the project writer needs to do is add a dir in the test_data dir corresponding to the project

@jbednar
Copy link
Contributor

jbednar commented May 22, 2019

We should make some of the examples use Intake for its own sake, but I think the anaconda-project.yml handles the typical case.

@jsignell
Copy link
Contributor

Just to circle back. I tried to use intake for the 1-billion osm data point case in #20, and ended up bailing because the download often takes longer than the 10 minutes before a notebook cell times out. I kind of decided that for really big downloads ( ~>3GB) we probably want to just tell the user to download the file rather than handling it behind the scenes. I have set up the infrastructure to use intake on these things generally (#22). The benefit to using intake over anaconda-project download is that the download only happens at the moment it is needed, this ends up being sort of annoying in AE because the deployment doesn't download the data on deployment.

@ppwadhwa
Copy link

ppwadhwa commented Oct 5, 2020

In the end, we are mostly using anaconda-project data loading support. We separately have some intake examples. So far, this has been working well.

@ppwadhwa ppwadhwa closed this as completed Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants