Wrong metrics reported with DCGM and H100 GPUs #30

btravouillon · 2025-01-10T19:45:33Z

One compute nodes with 8 H100, the values reported by the exporter are incorrect. This seems to be related to DCGM python bindings. We first observed this issue with DCGM 3.3.7, but it is reproducible with 3.3.8 and 3.3.9 as well.

OS Ubuntu 22.04 w/ kernel 5.15.0-101-generic
NVRM version: NVIDIA UNIX x86_64 Kernel Module 560.35.03 Fri Aug 16 21:39:15 UTC 2024

We have been able to reproduce the issue with this code (running PYTHONPATH=/usr/local/dcgm/bindings/python3/ python3):

import pydcgm, dcgm_fields, dcgm_structs
handle = pydcgm.DcgmHandle(None, 'localhost')
group = pydcgm.DcgmGroup(handle, groupName="GDCM-1033", groupType=dcgm_structs.DCGM_GROUP_DEFAULT)
fieldIds_dict = {
    dcgm_fields.DCGM_FI_DEV_NAME: 'name',
    dcgm_fields.DCGM_FI_DEV_UUID: 'uuid',
    dcgm_fields.DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR: 'cuda_visible_devices_str',
    dcgm_fields.DCGM_FI_DEV_POWER_USAGE: 'power_usage',
    dcgm_fields.DCGM_FI_DEV_FB_USED: 'fb_used',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP64_ACTIVE: 'fp64_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP32_ACTIVE: 'fp32_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP16_ACTIVE: 'fp16_active',
    dcgm_fields.DCGM_FI_PROF_SM_ACTIVE: 'sm_active',
    dcgm_fields.DCGM_FI_PROF_SM_OCCUPANCY: 'sm_occupancy',
    dcgm_fields.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: 'tensor_active',
    dcgm_fields.DCGM_FI_PROF_DRAM_ACTIVE: 'dram_active',
    dcgm_fields.DCGM_FI_PROF_PCIE_TX_BYTES: 'pcie_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_PCIE_RX_BYTES: 'pcie_rx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_TX_BYTES: 'nvlink_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_RX_BYTES: 'nvlink_rx_bytes',
}
field_group = pydcgm.DcgmFieldGroup(handle, name="GDCM-1033-fg", fieldIds=list(fieldIds_dict.keys()))
#  def GetLatestGpuValuesAsDict(self):
gpus = {}
data = group.samples.GetLatest_v2(field_group).values
for k in data.keys():
    for v in data[k].keys():
        data_dict = {}
        for metric_id in data[k][v].keys():
            data_dict[fieldIds_dict[metric_id]] = data[k][v][metric_id].values[0].value
        gpus[data_dict['uuid']] = data_dict

Result:

print(gpus)
{'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6',
 'cuda_visible_devices_str': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6',
 'power_usage': 69.437,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa',
 'cuda_visible_devices_str': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa',
 'power_usage': 72.917,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb',
 'cuda_visible_devices_str': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb',
 'power_usage': 71.695,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5',
 'cuda_visible_devices_str': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5',
 'power_usage': 68.852,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0',
 'cuda_visible_devices_str': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0',
 'power_usage': 69.672,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-51162a0f-d600-913b-c90f-38411bb9145d': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d',
 'cuda_visible_devices_str': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d',
 'power_usage': 69.566,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542',
 'cuda_visible_devices_str': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542',
 'power_usage': 70.208,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792},
 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc': {'name': 'NVIDIA H100 80GB HBM3',
 'uuid': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc',
 'cuda_visible_devices_str': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc',
 'power_usage': 69.656,
 'fb_used': 0,
 'fp64_active': 140737488355328.0,
 'fp32_active': 140737488355328.0,
 'fp16_active': 140737488355328.0,
 'sm_active': 140737488355328.0,
 'sm_occupancy': 140737488355328.0,
 'tensor_active': 140737488355328.0,
 'dram_active': 140737488355328.0,
 'pcie_tx_bytes': 9223372036854775792,
 'pcie_rx_bytes': 9223372036854775792,
 'nvlink_tx_bytes': 9223372036854775792,
 'nvlink_rx_bytes': 9223372036854775792}}

Value 140737488355328 is DCGM_FP64_BLANK and 9223372036854775792 is DCGM_INT64_BLANK.

We discussed this with NVIDIA, there recommendation is to add a call to group.samples.WatchFields(...) immediately after the creation of field_group.

import pydcgm, dcgm_fields, dcgm_structs
handle = pydcgm.DcgmHandle(None, 'localhost')
group = pydcgm.DcgmGroup(handle, groupName="GDCM-1033", groupType=dcgm_structs.DCGM_GROUP_DEFAULT)
fieldIds_dict = {
    dcgm_fields.DCGM_FI_DEV_NAME: 'name',
    dcgm_fields.DCGM_FI_DEV_UUID: 'uuid',
    dcgm_fields.DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR: 'cuda_visible_devices_str',
    dcgm_fields.DCGM_FI_DEV_POWER_USAGE: 'power_usage',
    dcgm_fields.DCGM_FI_DEV_FB_USED: 'fb_used',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP64_ACTIVE: 'fp64_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP32_ACTIVE: 'fp32_active',
    dcgm_fields.DCGM_FI_PROF_PIPE_FP16_ACTIVE: 'fp16_active',
    dcgm_fields.DCGM_FI_PROF_SM_ACTIVE: 'sm_active',
    dcgm_fields.DCGM_FI_PROF_SM_OCCUPANCY: 'sm_occupancy',
    dcgm_fields.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: 'tensor_active',
    dcgm_fields.DCGM_FI_PROF_DRAM_ACTIVE: 'dram_active',
    dcgm_fields.DCGM_FI_PROF_PCIE_TX_BYTES: 'pcie_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_PCIE_RX_BYTES: 'pcie_rx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_TX_BYTES: 'nvlink_tx_bytes',
    dcgm_fields.DCGM_FI_PROF_NVLINK_RX_BYTES: 'nvlink_rx_bytes',
}
field_group = pydcgm.DcgmFieldGroup(handle, name="GDCM-1033-fg", fieldIds=list(fieldIds_dict.keys()))

### Add group.samples.WatchFields(...) as per NVIDIA recommendation
group.samples.WatchFields(field_group, 500000, 2, 1)

#  def GetLatestGpuValuesAsDict(self):
gpus = {}
data = group.samples.GetLatest_v2(field_group).values
for k in data.keys():
    for v in data[k].keys():
        data_dict = {}
        for metric_id in data[k][v].keys():
            data_dict[fieldIds_dict[metric_id]] = data[k][v][metric_id].values[0].value
        gpus[data_dict['uuid']] = data_dict

Result is correct.

>>> print(gpus)
{'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6', 
 'cuda_visible_devices_str': 'GPU-6e19195c-c2c2-16a4-18da-f1502460eac6', 
 'power_usage': 572.191, 
 'fb_used': 32429, 
 'fp64_active': 1.7514509299667577e-09, 
 'fp32_active': 0.054007117442023134, 
 'fp16_active': 0.0, 
 'sm_active': 0.9125968658836753, 
 'sm_occupancy': 0.24450015849762033, 
 'tensor_active': 0.2793637431083633, 
 'dram_active': 0.21368455905985453, 
 'pcie_tx_bytes': 67358889, 
 'pcie_rx_bytes': 873509672, 
 'nvlink_tx_bytes': 4991556833, 
 'nvlink_rx_bytes': 4991558873}, 
 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa', 
 'cuda_visible_devices_str': 'GPU-e7aa527a-300f-5834-0a07-99407bb3d5fa', 
 'power_usage': 601.753, 
 'fb_used': 28617, 
 'fp64_active': 0.0, 
 'fp32_active': 0.05187467892504175, 
 'fp16_active': 0.0, 
 'sm_active': 0.9322035653753363, 
 'sm_occupancy': 0.24773293430776908, 
 'tensor_active': 0.2830590854513563, 
 'dram_active': 0.1826625855083514, 
 'pcie_tx_bytes': 26973102, 
 'pcie_rx_bytes': 304185631, 
 'nvlink_tx_bytes': 5062558183, 
 'nvlink_rx_bytes': 0}, 
 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb', 
 'cuda_visible_devices_str': 'GPU-17641ab7-f460-8287-8589-5ef81faf9dfb', 
 'power_usage': 600.739, 
 'fb_used': 28617, 
 'fp64_active': 2.1090803869619573e-09, 
 'fp32_active': 0.050695636413476734, 
 'fp16_active': 0.0, 
 'sm_active': 0.9342393368507337, 
 'sm_occupancy': 0.24635864440821617, 
 'tensor_active': 0.28626773465598904, 
 'dram_active': 0.16014441719578104, 
 'pcie_tx_bytes': 5224345, 
 'pcie_rx_bytes': 891117843, 
 'nvlink_tx_bytes': 5083159753, 
 'nvlink_rx_bytes': 5083161830}, 
 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5', 
 'cuda_visible_devices_str': 'GPU-41002eac-1cc7-2066-2006-d3de6c1676c5', 
 'power_usage': 568.164, 
 'fb_used': 32235, 
 'fp64_active': 2.1065882116852233e-09, 
 'fp32_active': 0.04971897050177449, 
 'fp16_active': 0.0, 
 'sm_active': 0.934624085992115, 
 'sm_occupancy': 0.22844803593917912, 
 'tensor_active': 0.2861345835375202, 
 'dram_active': 0.18574656829769384, 
 'pcie_tx_bytes': 17259837, 
 'pcie_rx_bytes': 174485390, 
 'nvlink_tx_bytes': 5085317445, 
 'nvlink_rx_bytes': 5085320562}, 
 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0', 
 'cuda_visible_devices_str': 'GPU-806dd04e-c6fb-a259-6514-5312805a27a0', 
 'power_usage': 162.685, 
 'fb_used': 63881, 
 'fp64_active': 0.0, 
 'fp32_active': 0.004500572554696292, 
 'fp16_active': 0.0, 
 'sm_active': 0.09186566209177509, 
 'sm_occupancy': 0.06431802102778272, 
 'tensor_active': 0.00419006513502011, 
 'dram_active': 0.049247060744584174, 
 'pcie_tx_bytes': 239610456, 
 'pcie_rx_bytes': 3945254569, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-51162a0f-d600-913b-c90f-38411bb9145d': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d', 
 'cuda_visible_devices_str': 'GPU-51162a0f-d600-913b-c90f-38411bb9145d', 
 'power_usage': 265.443, 
 'fb_used': 63931, 
 'fp64_active': 0.0, 
 'fp32_active': 0.01520669847811188, 
 'fp16_active': 0.0, 
 'sm_active': 0.2909458752707715, 
 'sm_occupancy': 0.24126590739614598, 
 'tensor_active': 0.016874855558779784, 
 'dram_active': 0.1493764348303913, 
 'pcie_tx_bytes': 204987026, 
 'pcie_rx_bytes': 3313134855, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542', 
 'cuda_visible_devices_str': 'GPU-f5e273b9-0661-7ac5-7d6f-2c006ba43542', 
 'power_usage': 199.216, 
 'fb_used': 63929, 
 'fp64_active': 0.0, 
 'fp32_active': 0.011083024234991155, 
 'fp16_active': 0.0, 
 'sm_active': 0.17074735183194517, 
 'sm_occupancy': 0.11907241625423479, 
 'tensor_active': 0.011165112325956574, 
 'dram_active': 0.07503498668885991, 
 'pcie_tx_bytes': 205568524, 
 'pcie_rx_bytes': 4262779163, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}, 
 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc': {'name': 'NVIDIA H100 80GB HBM3', 
 'uuid': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc', 
 'cuda_visible_devices_str': 'GPU-9a89a81b-6a19-f806-21bd-15dfd4bd7edc', 
 'power_usage': 138.305, 
 'fb_used': 63689, 
 'fp64_active': 0.0, 
 'fp32_active': 0.002582050984307507, 
 'fp16_active': 0.0, 
 'sm_active': 0.05673520518248468, 
 'sm_occupancy': 0.03611623630139204, 
 'tensor_active': 0.0020414085757989773, 
 'dram_active': 0.027607152196311702, 
 'pcie_tx_bytes': 196231163, 
 'pcie_rx_bytes': 3207094184, 
 'nvlink_tx_bytes': 0, 
 'nvlink_rx_bytes': 0}}

The text was updated successfully, but these errors were encountered:

btravouillon · 2025-01-10T19:50:46Z

Interestingly enough, the exporter now reports correct values on the node where I ran the patched reproducer.

Moreover, the values are updated over time.

btravouillon · 2025-01-15T20:31:35Z

Well, it looks like our reproducer is not representative of the code where group.samples.WatchFields() is already called.

slurm-job-exporter/slurm-job-exporter.py

Lines 171 to 173 in 3cf80d1

    
           self.field_group = pydcgm.DcgmFieldGroup(self.handle, name="slurm-job-exporter-fg", fieldIds=list(self.fieldIds_dict.keys())) 
        
           self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 0) 
        
           self.handle.GetSystem().UpdateAllFields(True)

However, it seems like running the reproducer unblock something in the DCGM instance used by the exporter. 🤔

btravouillon · 2025-01-15T22:14:44Z

I can't reproduce with this patch:

diff --git a/slurm-job-exporter.py b/slurm-job-exporter.py
index 8c3ee3f..ec4d69a 100644
--- a/slurm-job-exporter.py
+++ b/slurm-job-exporter.py
@@ -169,7 +169,7 @@ class SlurmJobCollector(object):
                             break
 
                     self.field_group = pydcgm.DcgmFieldGroup(self.handle, name="slurm-job-exporter-fg", fieldIds=list(self.fieldIds_dict.keys()))
-                    self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 0)
+                    self.group.samples.WatchFields(self.field_group, dcgm_update_interval * 1000 * 1000, dcgm_update_interval * 2.0, 5)
                     self.handle.GetSystem().UpdateAllFields(True)
 
                     print('Monitoring GPUs with DCGM with an update interval of {} seconds'.format(dcgm_update_interval))

Restart the services:

$ sudo systemctl stop slurm-job-exporter.service; sudo systemctl restart nvidia-dcgm.service; sudo systemctl start slurm-job-exporter.service;

Check the output:

slurm_job_utilization_gpu{account="acct",gpu="2",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.35592921744527
slurm_job_utilization_gpu{account="acct",gpu="3",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.00630602548209
slurm_job_utilization_gpu{account="acct",gpu="1",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.43721226057505
slurm_job_utilization_gpu{account="acct",gpu="0",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5878962",user="michel"} 80.35148053466808
slurm_job_utilization_gpu{account="acct",gpu="1",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 21.774675399315907
slurm_job_utilization_gpu{account="acct",gpu="3",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 5.693759693805377
slurm_job_utilization_gpu{account="acct",gpu="2",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 9.079627934720886
slurm_job_utilization_gpu{account="acct",gpu="0",gpu_type="NVIDIA H100 80GB HBM3",slurmjobid="5872724",user="benjamin"} 5.185860545867245

The bug seems to exist only when maxKeepSamples has no limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong metrics reported with DCGM and H100 GPUs #30

Wrong metrics reported with DCGM and H100 GPUs #30

btravouillon commented Jan 10, 2025

btravouillon commented Jan 10, 2025

btravouillon commented Jan 15, 2025

btravouillon commented Jan 15, 2025

Wrong metrics reported with DCGM and H100 GPUs #30

Wrong metrics reported with DCGM and H100 GPUs #30

Comments

btravouillon commented Jan 10, 2025

btravouillon commented Jan 10, 2025

btravouillon commented Jan 15, 2025

btravouillon commented Jan 15, 2025