Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WISH: Please respect CGroups CPU limits for parallelization, or default to single-core processing #393

Open
HenrikBengtsson opened this issue Nov 14, 2024 · 0 comments

Comments

@HenrikBengtsson
Copy link

CellBender lets pytorch use all CPU cores on the host by default, per https://cellbender.readthedocs.io/en/latest/reference/index.html:

--cpu-threads

    Number of threads to use when pytorch is run on CPU. Defaults to the number of logical cores.

Where I'm coming from: I think any software tool that may run in a multi-user environment should default to single-threaded/single-process processing. Defaulting to all CPU cores on the host wreaks havoc when there are other users or processes running at the same time. So, I argue single-core processing is the safest default. If a user has access to more CPU resource, then can actively request it.

If a software tool really wants to run in parallel by default, I argue it should be done by respecting what the system has allotted to the process. CGroups CPU limits is one such allocation, which becomes more and more common these days - you see it in HPC environments, in cloud services, etc. If CGroups limits each user/session/process to, say, four cores, and the machine has 192 cores, the current behavior of CellBender causes it to run 192 threads that are competing for 4 cores, resulting in a major context switching and slowdown. Many users are unaware that this is happening, so they suffer from this already now, and may conclude that CellBender is slower than it actually is.

I also think it should be limited to a fixed amount of CPU cores; as hosts get more and more cores these days, defaulting to the number of hardware CPU cores risks becoming too large as we have machines with 192 and 256 cores these days, resulting in a slowdown rather than a speed-up.

PS. This is based on observations in an HPC environment with 1000+ different users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant