Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cubed vs dask.array: Convergent evolution? #570

Open
TomNicholas opened this issue Sep 6, 2024 · 1 comment
Open

Cubed vs dask.array: Convergent evolution? #570

TomNicholas opened this issue Sep 6, 2024 · 1 comment

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Sep 6, 2024

For a while I've been thinking that Cubed and dask.array are potentially on a path of convergent evolution to a very similar design. After talking to @phofl at length at the NumFOCUS summit, I feel more confident in this - some explicit similarities below.

Dask is becoming more similar to Cubed:

  • Dask Expressions for arrays are basically a cubed.Plan. They are both intended to allow high-level optimizations (at the ~array-API level) before anything is materialized, and avoid materialization of each step until execution. The latter is meant to solve the scaling problem of a scheduler dealing with very large numbers of chunks (because the graph size then scales with number of arrays but not with number of chunks). There is lots of recent work on expressions for dask dataframes but expressions for dask.array haven't been implemented yet.
  • Dask P2P rechunking was introduced to deal with the pathological all-to-all rechunk case, and I understand is now the default. This is in response to the same complaints of unbounded unpredictable memory usage that motivated cubed. Both P2P and the Rechunker algorithm are similar in that they spill to disk in order to limit RAM usage. (@phofl tells me that many of these complaints about dask were in fact caused by scheduling bugs rather than design issues though.)
  • Task annotations (or some similar feature) in dask that track memory usage have been suggested before, but would be effectively replicating what the cubed.Spec allows you to track by design. (Interestingly @phofl said that tracking memory usage per-task is mostly useful for arrays rather than for dataframes, because dataframe partitions aren't of known length in general, and weird datatypes in dataframes make it hard to estimate memory usage.) Again this is not yet implemented in dask AFAIK.

Cubed could become more similar to Dask:

  • We want to run cubed efficiently on HPC (HPC executor #467) and single-machine (Cubed for larger-than-memory workloads on a single machine #492), like dask can. Cubed can currently run on single-machine and there is recent work to make it run on HPC via Dragon (Example using "dragon" start method #555).
  • Outside of a Cloud serverless context, cubed's task-based rechunk-via-disk algorithm is very often inefficient. Optimising rechunks by not using the disk becomes possible on HPC/single-machine. Doing this in-memory (In-memory rechunk #502) is more similar to how dask will if using threads and a LocalCluster (I believe). This hasn't been tried yet.
  • On a cluster you could also use communication instead of writing to disk to perform the rechunk - if Cubed used Dragon's distributed dict to do its zarr rechunk without writing to disk that would effectively be a rechunk via communication (with Dragon being the communication layer), and be more similar to dask. Generally every time cubed avoids using the worst-case on-disk rechunker algorithm, it would be an optimization that makes it behave more similarly to dask. This distributed dict idea also hasn't been tried yet.

Cubed-on-dask

  • If you took a Cubed plan and taught Cubed's Dask executor that you actually wanted to use dask's P2P rechunking algorithm to execute the rechunk steps, then cubed-on-dask would become rather similar to dask-on-dask...

Remaining differences:

  • Project scope: cubed is just arrays vs dask is arbitrary DAGs.
  • Level of technical debt / maturity - dask has more techical debt but is also much more mature.
  • Flexibility of executors - because cubed breaks everything down into embarrassingly parallel map/rechunk steps, you can imagine writing a cubed executor to run on almost anything (it just might be inefficient). Whilst it's possible (e.g. dask on ray), I don't think it's so simple to make dask run arbitrary graphs on a new executor.
  • P2P rechunking doesn't allow specifying memory usage, it just means it will be constant (correct me if that's wrong).
  • Dask has no way to deploy truly serverlessly like cubed can because it assumes you have the ability to communicate between workers / with the workers.
  • Environments / containers / propagating dependencies to workers is handled quite differently.

I don't know if there is an action item here but I wanted to raise it for discussion.

@tomwhite
Copy link
Member

tomwhite commented Sep 9, 2024

Thanks @TomNicholas, this is very interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants