Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible goals/pov: IBM autopilot tools #66

Open
schwesig opened this issue Oct 9, 2024 · 2 comments
Open

possible goals/pov: IBM autopilot tools #66

schwesig opened this issue Oct 9, 2024 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@schwesig
Copy link
Contributor

schwesig commented Oct 9, 2024

goal of this issue
preserve the emails and messages/ideas about this,
maybe leading to a pov or a goal, for sure not to forget about it and keep it open for discussion:

Starting point:
https://github.com/IBM/autopilot

Starting Questions

  • Are there any GPU-related tools that could be useful for baselining in the work we are involved in?
  • Besides NVIDIA tools (which are not open source), are there alternative GPU-related tools, particularly open-source ones, that we can use?

status on 2024/10/08

Summary:

  • Not really open source: We still need Nvidia software
  • Limited to get status before and after a job (for now/current version).
    • Errors during the job can not be detected/reacted on.
  • Works with Prometheus/Grafana

Baselining:

Autopilot, it can only check before and after the jobs, not during running (this feature is coming in the future).
If this is enough for now, we can use it.

Open Source:

Yes, but it’s still using Nvidia tools like nvidia-smi and DCGMI.
Not really a full alternative.

Limitations:

There is no checking during the jobs for now.
Errors cannot be detected in real-time.

Integration:

It works fine with Prometheus and Grafana.
Fitting into our system.

Setup:

Needs to run on every GPU node,
and Nvidia tools must be installed.
But results can be gathered.

Test Install

Image

  • The autopilot project has one metric (autopilot_health_checks) that reports the following data about different health concerns (pciebw GPU PCIe Link Bandwidth, power-slowdown GPU Power Slowdown, remapped, dcgm DCGM level 3 that I can see so far)
    Image
  • The docs also list additional statuses:
    • GPU PCIe Link Bandwidth: The PCIe NVidia bandwidth test to check host-to-device connection on each node
    • GPU Memory: GPUs remapped rows evaluation through nvidia-smi
    • GPU Memory Bandwidth Performance: GPUs memory bandwidth evaluation through DAXPY and DGEMM
    • GPU Diagnostics: NVidia DCGM (Data Center GPU Manager) diagnostics through dcgmi diag
    • GPU Power Slowdown: verify if power throttle is active through nvidia-smi
    • Network Reachability: ping to evaluate hosts reachability
    • Network Bandwidth: iperf3 to evaluate network bandwidth and hosts connectivity
    • PVC Create/Delete: given a storageclass, test the ability to successfully provision a Persistent Volume Claim
    • DCGM level 3: deep diagnostics through NVidia DCGM tool. This test runs as a separate Job that reserves all the GPUs in the node if they are free

Feedback Heidi

  • Evaluation shows some downsides but potential for targeted performance evaluations.
  • Concern about needing the tool on every node, but adding/removing it selectively could be useful for performance evals or debugging.
  • Current tools are based on Nvidia, and while useful for testing, open-source alternatives should be considered.
  • Current project and baselining work are also relying on Nvidia tools.
  • Staying in touch with S and I is important as the baselining work continues this quarter.
  • Potential to use selected baselining tools for R’s system on InstructLab for OpenShift AI.
  • Noted PCIe bandwidth issues during performance tests on H100s, which might be relevant to monitoring the test cluster during ML workloads. (trying to drive all the hardware (H100s) as fast as possible)

/CC @schwesig @computate @hpdempsey

@schwesig schwesig added the documentation Improvements or additions to documentation label Oct 9, 2024
@computate
Copy link
Member

Next step, I will integrate the provided autopilot metrics and grafana dashboards into NERC observability, and then do a demo.

@schwesig
Copy link
Contributor Author

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants