Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize distance calculation #142

Open
gordonkoehn opened this issue Aug 6, 2023 · 3 comments
Open

Parallelize distance calculation #142

gordonkoehn opened this issue Aug 6, 2023 · 3 comments

Comments

@gordonkoehn
Copy link
Collaborator

The runtime of distance calculation may exceed the actual MCMC runtime.

For MP3 distance with 30 mutations in the trees, the runtime of calculating the distances is more than double that of the actual MCMC chain.

This calls for making the distance calculation parallelizable.

I.e. compute the distances in chunks using multiprocessing in Python.

In particular, the function to be parallelized is:

yg.analyze.analyze_mcmc_run(mcmc_data, metric, base_tree)

Along with this, the number of threads should be adjusted in the snakemake rules.

@gordonkoehn
Copy link
Collaborator Author

@pawel-czyz FYI - distance calculation takes quite a lot of time in our experiments I just realized.

@pawel-czyz
Copy link
Member

This is a good point! (But I'm not sure if it should get a higher priority than SMC).

In terms of implementation, let's discuss, but to write down some ideas before I forget:

  1. Indeed, Snakemake can be a good idea here.
  2. Alternatively, joblib provides a very nice API allowing to wrap a function into a parallelized one. (Simpler than using multiprocessing.Pool.)

@gordonkoehn
Copy link
Collaborator Author

Yes, was thinking of the second one before. Used it in my last project a lot.

Absolutely, priority is low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants