Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong metrics reported with DCGM and H100 GPUs #30

Open
btravouillon opened this issue Jan 10, 2025 · 3 comments
Open

Wrong metrics reported with DCGM and H100 GPUs #30

btravouillon opened this issue Jan 10, 2025 · 3 comments

Comments

@btravouillon
Copy link
Contributor

One compute nodes with 8 H100, the values reported by the exporter are incorrect. This seems to be related to DCGM python bindings. We first observed this issue with DCGM 3.3.7, but it is reproducible with 3.3.8 and 3.3.9 as well.

  • OS Ubuntu 22.04 w/ kernel 5.15.0-101-generic
  • NVRM version: NVIDIA UNIX x86_64 Kernel Module 560.35.03 Fri Aug 16 21:39:15 UTC 2024

We have been able to reproduce the issue with this code (running PYTHONPATH=/usr/local/dcgm/bindings/python3/ python3):

import pydcgm, dcgm_fields, dcgm_structs
handle = pydcgm.DcgmHandle(None, 'localhost')
group = pydcgm.DcgmGroup(handle, groupName="GDCM-1033", groupType=dcgm_structs.DCGM_GROUP_DEFAULT)
fieldIds_dict = {
    dcgm_fields.DCGM_FI_DEV_NAME: 'name',
    dcgm_fields.DCGM_FI_DEV_UUID: 'uuid',
    dcgm_fields.DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR: 'cuda_visible_devices_str',
    dcgm_fields.DCGM_FI_DEV_POWER_USAGE: 'power_usage',
    dcgm_fields.DCGM_FI_DEV_FB_USED: 'fb_used',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP64_ACTIVE: 'fp64_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP32_ACTIVE: 'fp32_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP16_ACTIVE: 'fp16_active',
    dcgm_fields.DCGM_FI_PROF_SM_ACTIVE: 'sm_active',
    dcgm_fields.DCGM_FI_PROF_SM_OCCUPANCY: 'sm_occupancy',
    dcgm_fields.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: 'tensor_active',
    dcgm_fields.DCGM_FI_PROF_DRAM_ACTIVE: 'dram_active',
    dcgm_fields.DCGM_FI_PROF_PCIE_TX_BYTES: 'pcie_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_PCIE_RX_BYTES: 'pcie_rx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_TX_BYTES: 'nvlink_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_RX_BYTES: 'nvlink_rx_bytes',
}
field_group = pydcgm.DcgmFieldGroup(handle, name="GDCM-1033-fg", fieldIds=list(fieldIds_dict.keys()))
#  def GetLatestGpuValuesAsDict(self):
gpus = {}
data = group.samples.GetLatest_v2(field_group).values
for k in data.keys():
    for v in data[k].keys():
        data_dict = {}
        for metric_id in data[k][v].keys():
            data_dict[fieldIds_dict[metric_id]] = data[k][v][metric_id].values[0].value
        gpus[data_dict['uuid']] = data_dict

Result:

print(gpus)
{'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6',
 'cuda_visible_devices_str': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6',
 'power_usage': 69.437,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa',
 'cuda_visible_devices_str': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa',
 'power_usage': 72.917,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb',
 'cuda_visible_devices_str': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb',
 'power_usage': 71.695,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5',
 'cuda_visible_devices_str': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5',
 'power_usage': 68.852,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0',
 'cuda_visible_devices_str': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0',
 'power_usage': 69.672,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-51162a0f-d600-913b-c90f-38411bb9145d': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d',
 'cuda_visible_devices_str': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d',
 'power_usage': 69.566,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542',
 'cuda_visible_devices_str': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542',
 'power_usage': 70.208,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc',
 'cuda_visible_devices_str': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc',
 'power_usage': 69.656,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792}}

Value 140737488355328 is DCGM_FP64_BLANK and 9223372036854775792 is DCGM_INT64_BLANK.

We discussed this with NVIDIA, there recommendation is to add a call to group.samples.WatchFields(...) immediately after the creation of field_group.

import pydcgm, dcgm_fields, dcgm_structs
handle = pydcgm.DcgmHandle(None, 'localhost')
group = pydcgm.DcgmGroup(handle, groupName="GDCM-1033", groupType=dcgm_structs.DCGM_GROUP_DEFAULT)
fieldIds_dict = {
    dcgm_fields.DCGM_FI_DEV_NAME: 'name',
    dcgm_fields.DCGM_FI_DEV_UUID: 'uuid',
    dcgm_fields.DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR: 'cuda_visible_devices_str',
    dcgm_fields.DCGM_FI_DEV_POWER_USAGE: 'power_usage',
    dcgm_fields.DCGM_FI_DEV_FB_USED: 'fb_used',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP64_ACTIVE: 'fp64_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP32_ACTIVE: 'fp32_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP16_ACTIVE: 'fp16_active',
    dcgm_fields.DCGM_FI_PROF_SM_ACTIVE: 'sm_active',
    dcgm_fields.DCGM_FI_PROF_SM_OCCUPANCY: 'sm_occupancy',
    dcgm_fields.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: 'tensor_active',
    dcgm_fields.DCGM_FI_PROF_DRAM_ACTIVE: 'dram_active',
    dcgm_fields.DCGM_FI_PROF_PCIE_TX_BYTES: 'pcie_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_PCIE_RX_BYTES: 'pcie_rx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_TX_BYTES: 'nvlink_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_RX_BYTES: 'nvlink_rx_bytes',
}
field_group = pydcgm.DcgmFieldGroup(handle, name="GDCM-1033-fg", fieldIds=list(fieldIds_dict.keys()))

### Add group.samples.WatchFields(...) as per NVIDIA recommendation
group.samples.WatchFields(field_group, 500000, 2, 1)

#  def GetLatestGpuValuesAsDict(self):
gpus = {}
data = group.samples.GetLatest_v2(field_group).values
for k in data.keys():
    for v in data[k].keys():
        data_dict = {}
        for metric_id in data[k][v].keys():
            data_dict[fieldIds_dict[metric_id]] = data[k][v][metric_id].values[0].value
        gpus[data_dict['uuid']] = data_dict

Result is correct.

>>> print(gpus)
{'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6', 
 'cuda_visible_devices_str': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6', 
 'power_usage': 572.191, 
 'fb_used': 32429, 
 'fp64_active': 1.7514509299667577e-09, 
 'fp32_active': 0.054007117442023134, 
 'fp16_active': 0.0, 
 'sm_active': 0.9125968658836753, 
 'sm_occupancy': 0.24450015849762033, 
 'tensor_active': 0.2793637431083633, 
 'dram_active': 0.21368455905985453, 
 'pcie_tx_bytes': 67358889, 
 'pcie_rx_bytes': 873509672, 
 'nvlink_tx_bytes': 4991556833, 
 'nvlink_rx_bytes': 4991558873}, 
 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa', 
 'cuda_visible_devices_str': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa', 
 'power_usage': 601.753, 
 'fb_used': 28617, 
 'fp64_active': 0.0, 
 'fp32_active': 0.05187467892504175, 
 'fp16_active': 0.0, 
 'sm_active': 0.9322035653753363, 
 'sm_occupancy': 0.24773293430776908, 
 'tensor_active': 0.2830590854513563, 
 'dram_active': 0.1826625855083514, 
 'pcie_tx_bytes': 26973102, 
 'pcie_rx_bytes': 304185631, 
 'nvlink_tx_bytes': 5062558183, 
 'nvlink_rx_bytes': 0}, 
 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb', 
 'cuda_visible_devices_str': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb', 
 'power_usage': 600.739, 
 'fb_used': 28617, 
 'fp64_active': 2.1090803869619573e-09, 
 'fp32_active': 0.050695636413476734, 
 'fp16_active': 0.0, 
 'sm_active': 0.9342393368507337, 
 'sm_occupancy': 0.24635864440821617, 
 'tensor_active': 0.28626773465598904, 
 'dram_active': 0.16014441719578104, 
 'pcie_tx_bytes': 5224345, 
 'pcie_rx_bytes': 891117843, 
 'nvlink_tx_bytes': 5083159753, 
 'nvlink_rx_bytes': 5083161830}, 
 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5', 
 'cuda_visible_devices_str': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5', 
 'power_usage': 568.164, 
 'fb_used': 32235, 
 'fp64_active': 2.1065882116852233e-09, 
 'fp32_active': 0.04971897050177449, 
 'fp16_active': 0.0, 
 'sm_active': 0.934624085992115, 
 'sm_occupancy': 0.22844803593917912, 
 'tensor_active': 0.2861345835375202, 
 'dram_active': 0.18574656829769384, 
 'pcie_tx_bytes': 17259837, 
 'pcie_rx_bytes': 174485390, 
 'nvlink_tx_bytes': 5085317445, 
 'nvlink_rx_bytes': 5085320562}, 
 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0', 
 'cuda_visible_devices_str': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0', 
 'power_usage': 162.685, 
 'fb_used': 63881, 
 'fp64_active': 0.0, 
 'fp32_active': 0.004500572554696292, 
 'fp16_active': 0.0, 
 'sm_active': 0.09186566209177509, 
 'sm_occupancy': 0.06431802102778272, 
 'tensor_active': 0.00419006513502011, 
 'dram_active': 0.049247060744584174, 
 'pcie_tx_bytes': 239610456, 
 'pcie_rx_bytes': 3945254569, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-51162a0f-d600-913b-c90f-38411bb9145d': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d', 
 'cuda_visible_devices_str': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d', 
 'power_usage': 265.443, 
 'fb_used': 63931, 
 'fp64_active': 0.0, 
 'fp32_active': 0.01520669847811188, 
 'fp16_active': 0.0, 
 'sm_active': 0.2909458752707715, 
 'sm_occupancy': 0.24126590739614598, 
 'tensor_active': 0.016874855558779784, 
 'dram_active': 0.1493764348303913, 
 'pcie_tx_bytes': 204987026, 
 'pcie_rx_bytes': 3313134855, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542', 
 'cuda_visible_devices_str': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542', 
 'power_usage': 199.216, 
 'fb_used': 63929, 
 'fp64_active': 0.0, 
 'fp32_active': 0.011083024234991155, 
 'fp16_active': 0.0, 
 'sm_active': 0.17074735183194517, 
 'sm_occupancy': 0.11907241625423479, 
 'tensor_active': 0.011165112325956574, 
 'dram_active': 0.07503498668885991, 
 'pcie_tx_bytes': 205568524, 
 'pcie_rx_bytes': 4262779163, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc', 
 'cuda_visible_devices_str': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc', 
 'power_usage': 138.305, 
 'fb_used': 63689, 
 'fp64_active': 0.0, 
 'fp32_active': 0.002582050984307507, 
 'fp16_active': 0.0, 
 'sm_active': 0.05673520518248468, 
 'sm_occupancy': 0.03611623630139204, 
 'tensor_active': 0.0020414085757989773, 
 'dram_active': 0.027607152196311702, 
 'pcie_tx_bytes': 196231163, 
 'pcie_rx_bytes': 3207094184, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}}
@btravouillon
Copy link
Contributor Author

Interestingly enough, the exporter now reports correct values on the node where I ran the patched reproducer.
image

Moreover, the values are updated over time.
image

@btravouillon
Copy link
Contributor Author

Well, it looks like our reproducer is not representative of the code where group.samples.WatchFields() is already called.

self.field_group = pydcgm.DcgmFieldGroup(self.handle, name="slurm-job-exporter-fg", fieldIds=list(self.fieldIds_dict.keys()))
self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 0)
self.handle.GetSystem().UpdateAllFields(True)

However, it seems like running the reproducer unblock something in the DCGM instance used by the exporter. 🤔

@btravouillon
Copy link
Contributor Author

I can't reproduce with this patch:

diff --git a/slurm-job-exporter.py b/slurm-job-exporter.py
index 8c3ee3f..ec4d69a 100644
--- a/slurm-job-exporter.py
+++ b/slurm-job-exporter.py
@@ -169,7 +169,7 @@ class SlurmJobCollector(object):
                             break
 
                     self.field_group = pydcgm.DcgmFieldGroup(self.handle, name="slurm-job-exporter-fg", fieldIds=list(self.fieldIds_dict.keys()))
-                    self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 0)
+                    self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 5)
                     self.handle.GetSystem().UpdateAllFields(True)
 
                     print('Monitoring GPUs with DCGM with an update interval of {} seconds'.format(dcgm_update_interval))

Restart the services:

$ sudo systemctl stop slurm-job-exporter.service; sudo systemctl restart nvidia-dcgm.service; sudo systemctl start slurm-job-exporter.service;

Check the output:

slurm_job_utilization_gpu{account="acct",gpu="2",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.35592921744527
slurm_job_utilization_gpu{account="acct",gpu="3",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.00630602548209
slurm_job_utilization_gpu{account="acct",gpu="1",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.43721226057505
slurm_job_utilization_gpu{account="acct",gpu="0",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.35148053466808
slurm_job_utilization_gpu{account="acct",gpu="1",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 21.774675399315907
slurm_job_utilization_gpu{account="acct",gpu="3",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 5.693759693805377
slurm_job_utilization_gpu{account="acct",gpu="2",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 9.079627934720886
slurm_job_utilization_gpu{account="acct",gpu="0",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 5.185860545867245

The bug seems to exist only when maxKeepSamples has no limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant