Parallelize distance calculation #142

gordonkoehn · 2023-08-06T12:05:48Z

The runtime of distance calculation may exceed the actual MCMC runtime.

For MP3 distance with 30 mutations in the trees, the runtime of calculating the distances is more than double that of the actual MCMC chain.

This calls for making the distance calculation parallelizable.

I.e. compute the distances in chunks using multiprocessing in Python.

In particular, the function to be parallelized is:

yg.analyze.analyze_mcmc_run(mcmc_data, metric, base_tree)

Along with this, the number of threads should be adjusted in the snakemake rules.

The text was updated successfully, but these errors were encountered:

gordonkoehn · 2023-08-06T12:06:49Z

@pawel-czyz FYI - distance calculation takes quite a lot of time in our experiments I just realized.

pawel-czyz · 2023-08-06T18:10:59Z

This is a good point! (But I'm not sure if it should get a higher priority than SMC).

In terms of implementation, let's discuss, but to write down some ideas before I forget:

Indeed, Snakemake can be a good idea here.
Alternatively, joblib provides a very nice API allowing to wrap a function into a parallelized one. (Simpler than using multiprocessing.Pool.)

gordonkoehn · 2023-08-08T08:57:40Z

Yes, was thinking of the second one before. Used it in my last project a lot.

Absolutely, priority is low.

gordonkoehn added 🦾 action: enhancement New feature or request ❗priority: low ⏰ time estimate: M labels Aug 6, 2023

Provide feedback