Hi @Scusemua - I'm raising this because I couldn't find your email anywhere!
@tomwhite and I have been working on Cubed - which is extremely similar to Wukong in it's goals.
If I understand correctly, Wukong aims to create a general-purpose serverless DAG execution framework for data science workloads by building on top of dask.distributed.
Cubed also aims to be a serverless DAG execution framework for data science workloads inspired by Dask, but restricts the problem domain to numpy-like array computations, and does not directly use dask (only borrows some of its API/abstractions). Cubed also uses the cloud-native array storage format Zarr to store state (intermediate arrays) between operations.
Both projects cite PyWren as an inspiration explicitly.
Some of the problems you mention with PyWren are solved by Cubed's approach - in particular on this slide of your talk on Wukong the rapid scaling is handled by serverless frameworks like Lithops, the excessive data movement is handled by writing to Zarr, and the per-function resource limitations are not an issue because each function only needs to process a single chunk.
Of possible interest to you:
Hi @Scusemua - I'm raising this because I couldn't find your email anywhere!
@tomwhite and I have been working on Cubed - which is extremely similar to Wukong in it's goals.
If I understand correctly, Wukong aims to create a general-purpose serverless DAG execution framework for data science workloads by building on top of
dask.distributed.Cubed also aims to be a serverless DAG execution framework for data science workloads inspired by Dask, but restricts the problem domain to numpy-like array computations, and does not directly use dask (only borrows some of its API/abstractions). Cubed also uses the cloud-native array storage format Zarr to store state (intermediate arrays) between operations.
Both projects cite PyWren as an inspiration explicitly.
Some of the problems you mention with PyWren are solved by Cubed's approach - in particular on this slide of your talk on Wukong the rapid scaling is handled by serverless frameworks like Lithops, the excessive data movement is handled by writing to Zarr, and the per-function resource limitations are not an issue because each function only needs to process a single chunk.
Of possible interest to you:
Talk I'm giving tomorrow afternoon on Cubed
Blog post comparing Cubed and
dask.arrayon a Climate Science WorkloadPangeo distributed arrays working group, and collection of use cases