possible goals/pov: IBM autopilot tools #66

schwesig · 2024-10-09T15:57:23Z

goal of this issue
preserve the emails and messages/ideas about this,
maybe leading to a pov or a goal, for sure not to forget about it and keep it open for discussion:

Starting point:
https://github.com/IBM/autopilot

Starting Questions

Are there any GPU-related tools that could be useful for baselining in the work we are involved in?
Besides NVIDIA tools (which are not open source), are there alternative GPU-related tools, particularly open-source ones, that we can use?

status on 2024/10/08

Summary:

Not really open source: We still need Nvidia software
Limited to get status before and after a job (for now/current version).
- Errors during the job can not be detected/reacted on.

Works with Prometheus/Grafana

Baselining:

Autopilot, it can only check before and after the jobs, not during running (this feature is coming in the future).
If this is enough for now, we can use it.

Open Source:

Yes, but it’s still using Nvidia tools like nvidia-smi and DCGMI.
Not really a full alternative.

Limitations:

There is no checking during the jobs for now.
Errors cannot be detected in real-time.

Integration:

It works fine with Prometheus and Grafana.
Fitting into our system.

Setup:

Needs to run on every GPU node,
and Nvidia tools must be installed.
But results can be gathered.

Test Install

The autopilot project has one metric (autopilot_health_checks) that reports the following data about different health concerns (pciebw GPU PCIe Link Bandwidth, power-slowdown GPU Power Slowdown, remapped, dcgm DCGM level 3 that I can see so far)
The docs also list additional statuses:
- GPU PCIe Link Bandwidth: The PCIe NVidia bandwidth test to check host-to-device connection on each node
- GPU Memory: GPUs remapped rows evaluation through nvidia-smi
- GPU Memory Bandwidth Performance: GPUs memory bandwidth evaluation through DAXPY and DGEMM
- GPU Diagnostics: NVidia DCGM (Data Center GPU Manager) diagnostics through dcgmi diag
- GPU Power Slowdown: verify if power throttle is active through nvidia-smi
- Network Reachability: ping to evaluate hosts reachability
- Network Bandwidth: iperf3 to evaluate network bandwidth and hosts connectivity
- PVC Create/Delete: given a storageclass, test the ability to successfully provision a Persistent Volume Claim
- DCGM level 3: deep diagnostics through NVidia DCGM tool. This test runs as a separate Job that reserves all the GPUs in the node if they are free

Feedback Heidi

Evaluation shows some downsides but potential for targeted performance evaluations.
Concern about needing the tool on every node, but adding/removing it selectively could be useful for performance evals or debugging.
Current tools are based on Nvidia, and while useful for testing, open-source alternatives should be considered.
Current project and baselining work are also relying on Nvidia tools.
Staying in touch with S and I is important as the baselining work continues this quarter.
Potential to use selected baselining tools for R’s system on InstructLab for OpenShift AI.
Noted PCIe bandwidth issues during performance tests on H100s, which might be relevant to monitoring the test cluster during ML workloads. (trying to drive all the hardware (H100s) as fast as possible)

/CC @schwesig @computate @hpdempsey

computate · 2024-10-09T18:35:48Z

Next step, I will integrate the provided autopilot metrics and grafana dashboards into NERC observability, and then do a demo.

schwesig · 2024-10-16T14:11:18Z

schwesig added the documentation Improvements or additions to documentation label Oct 9, 2024

schwesig assigned schwesig, computate and hpdempsey Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible goals/pov: IBM autopilot tools #66

possible goals/pov: IBM autopilot tools #66

schwesig commented Oct 9, 2024

computate commented Oct 9, 2024

schwesig commented Oct 16, 2024

possible goals/pov: IBM autopilot tools #66

possible goals/pov: IBM autopilot tools #66

Comments

schwesig commented Oct 9, 2024

Starting Questions

status on 2024/10/08

Summary:

Baselining:

Open Source:

Limitations:

Integration:

Setup:

Test Install

Feedback Heidi

computate commented Oct 9, 2024

schwesig commented Oct 16, 2024