You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We really need a way to cache data in a less expensive way. The job above just wasted 12 DCU just to find out that one of the files wasn't available.
If we can, we should restrict the amount of workers and download within threads on a single worker (see pangeo-forge/pangeo-forge-recipes#713). The scaling seems to only be efficient if we have fast downloads?
The text was updated successfully, but these errors were encountered:
My wish here would be for a stage that does the following:
If possible, check all urls first, and fail fast if one of them is not available
Absolute 💎 bonus feature: If I could pass multiple lists of urls, determine which list is available at runtime, and maybe even pick the fastest connection bases on initial ping...this might be too complicated
Have an upper limit on how many connections can be established to a given server (I believe this is partially implemented as the 'max_concurrency' argument in OpenWithFsspec
Use as little workers as possible to download several files in parallel, or download parts of large files in parallel using threads on the workers.
https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-28_20_29_34-3856017896282669695;logsSeverity=INFO?project=leap-pangeo&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))&authuser=1&supportedpurview=project
We really need a way to cache data in a less expensive way. The job above just wasted 12 DCU just to find out that one of the files wasn't available.
If we can, we should restrict the amount of workers and download within threads on a single worker (see pangeo-forge/pangeo-forge-recipes#713). The scaling seems to only be efficient if we have fast downloads?
The text was updated successfully, but these errors were encountered: