-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicore (in)efficiency #22
Comments
Hi, Regarding the parallelization I think QUILT and STITCH use basic mclapply doing parallelization which would fork multiple processes with each core handling several samples. If the number of samples (60) is not a multiple of nCores (8), then the core with more samples in will be the bottleneck. However, about the screenshot you showed I suppose it is unexpected. |
Thanks @Zilong-Li for your swift reaction on my post. We will repeat the tests by keeping in mind that sample size must be a multiple of In case you'd like to reproduce our results, please let us know. |
Hi, Minor point, the argument is This is an interesting one. QUILT (and STITCH) indeed use pretty straightforward So for instance, you could test things out more generally by trying the following from the command line, varying
Hopefully this should replicate behaviour like what you are seeing. If you figure out additional arguments to As a more minor point, QUILT imputes each sample independently (unlike STITCH), so in general on an HPC I would recommend splitting samples into small batches and running with Best, |
Thanks @rwdavies for your comments. With that, I guess we can close this thread, unless someone else still has a remark. |
I don't think the reference you're citing is particularly definitive. But I'll admit I'm having trouble finding something exhaustive. It usually works fine in my hands, e.g. setting |
Dear
I am a HPC sysadmin, and together with a QUILT user, we are observing a strange multi-core behavior of the software on a cluster with 128-core AMD nodes.
There, QUILT is installed and launched from a conda environment. We block a full node with about 120GB memory, and try to impute 60 samples, but using different core numbers, which is controlled by the
ncore=
command line argument.Apparently (please correct me if I am mistaken), QUILT spawns multiple processes (and not threads), so, I expected that each process would be pinned to one physical core on the system (hyperthreading is disabled on the target machine).
What we observe on the compute node is that all "active" processes are all packed on the first physical core on the node, with inefficient CPU activity/usage. As a result of that, the runtime of the test example does not scale with the number of cores used, i.e.
ncore
.The attached screenshot demonstrates the output of
htop
command.We would like to know whether this is a standard behavior or there are still hooks in launching a job which we do not know completely.
Any feedback is welcome.
Kind regards
Ehsan & @SaraBecelaere
The text was updated successfully, but these errors were encountered: