You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
goal of this issue
preserve the emails and messages/ideas about this,
maybe leading to a pov or a goal, for sure not to forget about it and keep it open for discussion:
Are there any GPU-related tools that could be useful for baselining in the work we are involved in?
Besides NVIDIA tools (which are not open source), are there alternative GPU-related tools, particularly open-source ones, that we can use?
status on 2024/10/08
Summary:
Not really open source: We still need Nvidia software
Limited to get status before and after a job (for now/current version).
Errors during the job can not be detected/reacted on.
Works with Prometheus/Grafana
Baselining:
Autopilot, it can only check before and after the jobs, not during running (this feature is coming in the future).
If this is enough for now, we can use it.
Open Source:
Yes, but it’s still using Nvidia tools like nvidia-smi and DCGMI.
Not really a full alternative.
Limitations:
There is no checking during the jobs for now.
Errors cannot be detected in real-time.
Integration:
It works fine with Prometheus and Grafana.
Fitting into our system.
Setup:
Needs to run on every GPU node,
and Nvidia tools must be installed.
But results can be gathered.
Test Install
The autopilot project has one metric (autopilot_health_checks) that reports the following data about different health concerns (pciebw GPU PCIe Link Bandwidth, power-slowdown GPU Power Slowdown, remapped, dcgm DCGM level 3 that I can see so far)
The docs also list additional statuses:
GPU PCIe Link Bandwidth: The PCIe NVidia bandwidth test to check host-to-device connection on each node
GPU Memory: GPUs remapped rows evaluation through nvidia-smi
GPU Memory Bandwidth Performance: GPUs memory bandwidth evaluation through DAXPY and DGEMM
GPU Diagnostics: NVidia DCGM (Data Center GPU Manager) diagnostics through dcgmi diag
GPU Power Slowdown: verify if power throttle is active through nvidia-smi
Network Reachability: ping to evaluate hosts reachability
Network Bandwidth: iperf3 to evaluate network bandwidth and hosts connectivity
PVC Create/Delete: given a storageclass, test the ability to successfully provision a Persistent Volume Claim
DCGM level 3: deep diagnostics through NVidia DCGM tool. This test runs as a separate Job that reserves all the GPUs in the node if they are free
Feedback Heidi
Evaluation shows some downsides but potential for targeted performance evaluations.
Concern about needing the tool on every node, but adding/removing it selectively could be useful for performance evals or debugging.
Current tools are based on Nvidia, and while useful for testing, open-source alternatives should be considered.
Current project and baselining work are also relying on Nvidia tools.
Staying in touch with S and I is important as the baselining work continues this quarter.
Potential to use selected baselining tools for R’s system on InstructLab for OpenShift AI.
Noted PCIe bandwidth issues during performance tests on H100s, which might be relevant to monitoring the test cluster during ML workloads. (trying to drive all the hardware (H100s) as fast as possible)
goal of this issue
preserve the emails and messages/ideas about this,
maybe leading to a pov or a goal, for sure not to forget about it and keep it open for discussion:
Starting point:
https://github.com/IBM/autopilot
Starting Questions
status on 2024/10/08
Summary:
Baselining:
Autopilot, it can only check before and after the jobs, not during running (this feature is coming in the future).
If this is enough for now, we can use it.
Open Source:
Yes, but it’s still using Nvidia tools like nvidia-smi and DCGMI.
Not really a full alternative.
Limitations:
There is no checking during the jobs for now.
Errors cannot be detected in real-time.
Integration:
It works fine with Prometheus and Grafana.
Fitting into our system.
Setup:
Needs to run on every GPU node,
and Nvidia tools must be installed.
But results can be gathered.
Test Install
Feedback Heidi
/CC @schwesig @computate @hpdempsey
The text was updated successfully, but these errors were encountered: