-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong metrics reported with DCGM and H100 GPUs #30
Comments
Well, it looks like our reproducer is not representative of the code where slurm-job-exporter/slurm-job-exporter.py Lines 171 to 173 in 3cf80d1
However, it seems like running the reproducer unblock something in the DCGM instance used by the exporter. 🤔 |
I can't reproduce with this patch:
Restart the services:
Check the output:
The bug seems to exist only when |
One compute nodes with 8 H100, the values reported by the exporter are incorrect. This seems to be related to DCGM python bindings. We first observed this issue with DCGM 3.3.7, but it is reproducible with 3.3.8 and 3.3.9 as well.
We have been able to reproduce the issue with this code (running
PYTHONPATH=/usr/local/dcgm/bindings/python3/ python3
):Result:
Value 140737488355328 is DCGM_FP64_BLANK and 9223372036854775792 is DCGM_INT64_BLANK.
We discussed this with NVIDIA, there recommendation is to add a call to
group.samples.WatchFields(...)
immediately after the creation of field_group.Result is correct.
The text was updated successfully, but these errors were encountered: