-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running srun, sbatch, salloc from within pyxis #31
Comments
It might be possible, but honestly I haven't tried. You will likely need to have the same Slurm version inside the container than on the cluster (or bind-mount binaries/libraries). You want to run as non-remapped root, and you might need to bind-mount some more files from the host (I don't think slurmd uses a UNIX domain socket, so at least it should be fine on this side). If it fails initially, using |
I did a quick test, but at the moment it seems unfeasible, as one needs to bind mount too many things. For example:
With this I was able to at least run Is there any plan to add official support? One connected question, is there a plan to support passing the enroot container to |
No, not right now, sorry. Because most of the work can be done in the container image (e.g. by installing the same stack / scripts inside the container image). You could also do a custom enroot hook to mount everything that is needed, I don't think it should be done by pyxis.
This has been a requested a few times, so we are considering it. I can't tell you for sure if it will happen, or when. Thanks. |
Could you clarify on what you mean by e.g. by installing the same stack / scripts inside the container image? Support for One highly specific example (which I am playing with, just to give some perspective) is the Kaldi toolkit (https://github.com/kaldi-asr/kaldi). Sure, one can run it from inside the container image (https://ngc.nvidia.com/catalog/containers/nvidia:kaldi) started with a single I would say this is not good practice as during training for half of the time only cpus are in use. Most of the scripts, however, have already been written to support gridEngine/slurm, they generate |
I mean that you could craft a custom container image with the same Slurm libraries, binaries and configuration than the one you install on your cluster. I guess your Slurm version doesn't change often so it might be fine.
I see, we have similar use cases but we took a different approach: the sbatch script uses |
FWIW the following is an enroot config that should do the job (I used it in the past), you can convert it to enroot system configuration files and have SLURM be injected automatically in all your containers.
|
We are very happy with #55, but now users want to do multinode jobs with sbatch. |
I don't think this is something we can support reliably unless we get https://bugs.schedmd.com/show_bug.cgi?id=12230 OR some kind of API compatibility guarantee OR you build your containers with the same version of Slurm than what is installed on the cluster (i.e. non-portable containers). |
I do have a use case for |
I would like to use Enroot containers to provide toolchain environments for Slurm, i.e. as a sort of substitute for lmod modules. A typical example are NVIDIA Container images which can contain source code with multiple steps.
My question is, is it possible to generate Slurm jobs from within a pyxis/enroot container?
The text was updated successfully, but these errors were encountered: