Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job_resource_manager_cgroups: Nvidia devices not hidden for the first job after boot #193

Open
bzizou opened this issue Dec 3, 2021 · 2 comments

Comments

@bzizou
Copy link
Contributor

bzizou commented Dec 3, 2021

The Enable_devices_cg = "YES" enables hide of GPU devices that are not reserved in the current job.
But the feature doesn't seem to work for the first job just after a reboot of the node. The next jobs are ok.
Tested with Debian 9.13 nodes, V100 and A100 GPUS, rebooted several times, the problem is reproducible

@bzizou
Copy link
Contributor Author

bzizou commented Dec 3, 2021

No errors in the logs.
Message [TAKTUK OUTPUT] bigfoot10-3: perl - init (4265): output > [job_resource_manager_cgroups][30][bigfoot10][DEBUG] Deny NVIDIA GPUs: 1
Doing a manual config hides the GPU:

root@bigfoot10:~# echo 'c 195:1 rwm' > /dev/oar_cgroups_links/devices/oar/bzizou_31/devices.deny

@bzizou
Copy link
Contributor Author

bzizou commented Dec 23, 2021

Workaround: running /usr/bin/nvidia-smi -L || exit 5 from the /etc/default/oar-node startup script fixes the problem (probably by load nvidia drivers). It also checks if nvidia drivers are ok at boot time by the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant