Improve resource utilization/efficiency of file caching #713

jbusecke · 2024-03-19T21:56:32Z

Nothing super specific here, but wanted to brain dump and get a broader discussion going.

As part of my CMIP work my recipes often download many files from sometimes slow servers. This seems to take very long and frequently scales up to many workers, which increases cost.

Looking at the Dataflow resource metrics

it seems like there is one worker spun up per file? There is a spike in CPU useage initially, but then the worker idles around mostly.

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Perhaps something to chat about on Thu @ranchodeluxe @moradology ?

cisaacstern · 2024-03-19T23:31:31Z

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Yes! I think this may do what you want:

https://github.com/google/xarray-beam/blob/main/xarray_beam/_src/threadmap.py

This was referenced Mar 29, 2024

More efficient downloading leap-stc/cmip6-leap-feedstock#118

Open

Stage to check if URLS are alive? #719

Closed

This was referenced May 11, 2024

Random vs 'deterministic' data_node selection leap-stc/cmip6-leap-feedstock#160

Open

[REQUEST]: HighResMIP HadGEM vars for ocean sound speed calculation leap-stc/cmip6-leap-feedstock#72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resource utilization/efficiency of file caching #713

Improve resource utilization/efficiency of file caching #713

jbusecke commented Mar 19, 2024

cisaacstern commented Mar 19, 2024

Improve resource utilization/efficiency of file caching #713

Improve resource utilization/efficiency of file caching #713

Comments

jbusecke commented Mar 19, 2024

cisaacstern commented Mar 19, 2024