-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running scontrol from within container #129
Comments
I've seen multiple approaches for this. It's usually it's handled at the Anyway, here what is commonly used:
|
I found that it's done here: https://github.com/SchedMD/slurm/blob/slurm-23-11-0-1/src/slurmctld/job_mgr.c#L16376-L16384 |
Thanks for the info. I'm aware that there is a delay, although a nuisance I don't find it a big problem, at the moment. The downside of the approach with queuing a follow-up job before lunching the application, is that it requires execution of multiple commands, or the automation of a single sbatch script that will do so, which can become tricky. In addition it does not really solve the main problem (how to notify a Lightning app that is running in an enroot container to create a checkpoint if this happens to be between two regular checkpoints). I'm aware that one can increase the frequency of checkpointing, what I'm asking really, is, is there is a way to make Lighting that is run in an enroot/pyxis container behave as if it were run on baremetal. Giving the step a slightly lower than the allocated time will ensure that the sbatch script's signal handler will have some time left. But the app that is run in the srun step will be dead, so no chance of it performing a checkpoint. Note also that when SLURM sends an abort signal it will wait for some time on its own https://github.com/SchedMD/slurm/blob/769da35200d4a2c0f42a6e060b2b180ed95bfc8e/src/api/step_launch.c#L671. We currently do the following. The main sbatch handles the signal ( |
You should consider the checkpointing and the follow-up job separately. They are both orthogonal to containers.
The approach above also works for regular PyTorch. It looks like there is a problem with signal handling in PyTorch Lightning, and the requeue does not work inside containers, but you don't need to rely on any of those. |
In slurm many scripts use signals to get a notification before the time limit is reached. They use them to create a checkpoint and force a requeue of the job in question. One such example is Lightning (https://github.com/Lightning-AI/pytorch-lightning/blob/520c1e4713340f5bbf66de215471c2863e8fbdf2/src/lightning/pytorch/trainer/connectors/signal_connector.py#L86).
However, when running in an enroot container with pyxis, the command
scontrol
is not available. Any thoughts how this could be resolved? Similar, but not the same as #31; here we'd like just to call thescontrol requeue $SLURM_JOB_ID
.The text was updated successfully, but these errors were encountered: