-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diversity calculations on windowed datasets generate lots of unmanaged memory #1278
Comments
Very interesting, thanks @percyfal! Can you give more details about how you got these numbers please? |
Basically by monitoring the task manager. I will be running lots of these calculations in the coming days so I'll make sure to include a screenshot. On the low-memory nodes (256GB RAM), the worker memory bars very quickly fill up with light blue (unmanaged) and orange (spillover). |
Unfortunately the task graph is too large to display so I couldn't easily tell what the scheduler is holding on to (@benjeffery suggested I look at this). |
I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908). @tomwhite - are the popgen calculations something that will run on Cubed currently? |
Not yet unfortunately, since the windowing introduces variable-sized chunks, which Cubed can't currently handle. Having said that, I'd like to look at getting popgen methods working on Cubed in the new year. I'm wondering if the xarray groupby work (using flox) would work here, since that is something that Cubed does support (see cubed-dev/cubed#476). @percyfal what windowing method are you using? It would be useful to have an example to target if it's possible to share the data (or even just the windows) easily. |
I'm currently using |
It would be useful to have the precise call to |
Ok, I'll compile the data and let you know where you can get hold of it. |
When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The
call_genotype
data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.The text was updated successfully, but these errors were encountered: