Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should GPU energy use be estimated? #37

Open
Tracked by #38 ...
adrianco opened this issue Apr 9, 2024 · 5 comments
Open
Tracked by #38 ...

How should GPU energy use be estimated? #37

adrianco opened this issue Apr 9, 2024 · 5 comments
Assignees

Comments

@adrianco
Copy link
Contributor

adrianco commented Apr 9, 2024

Outline Action Item Details

We have a reasonable handle on CPU energy use by taking CPU utilization and mapping it to an energy curve driven by the Thermal Design Power (TDP) of a package - which is sometimes the only public data that is available. GPUs are becoming more common, have a higher TDP than CPUs, but we don't have an easy or standard way to measure the utilization of the GPUs in a system. Propose to reach out to contacts at NVIDIA to see if we can find some answers and encourage them to join GSF.

Issue dependency with other WGs Groups

No response

@adrianco adrianco self-assigned this Apr 9, 2024
@adrianco
Copy link
Contributor Author

Scaphandre has some discussion and a TODO for GPU measurement
hubblo-org/scaphandre#24

NVIDIA data is available for later model and datacenter class GPUs, not for some desktop models. This data source is reported as available for NVIDIA based cloud instances on AWS.
https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7ef7dff0ff14238d08a19ad7fb23fc87

The data is milliwatts averaged over a one second interval as an integer.

@seanmcilroy29 seanmcilroy29 mentioned this issue Apr 19, 2024
29 tasks
@rootfs
Copy link

rootfs commented May 2, 2024

Kepler currently support NVIDIA GPU (through both nvml and dcgm) and is also working on Intel Gaudi GPU support.

We have a recent tutorial of using Kepler to measure LLM energy consumption and evaluating sustainability in terms of token/watts

@seanmcilroy29 seanmcilroy29 mentioned this issue May 7, 2024
25 tasks
@marceloamaral
Copy link

As @rootfs mentioned, in the Kepler, we collect data on both the GPU utilization of processes and the total GPU power consumption using the NVML library. Then, we distribute the total GPU power consumption among all processes utilizing the GPU based on their utilization.
In Multi-Instance GPU (MIG) scenarios, the calculation method varies a lillte bit. Kepler uses the DCMI metrics to determine MIG slice utilization and distribute the total GPU power accordingly among the MIG slices.

@TheElectronWill
Copy link

TheElectronWill commented May 15, 2024

Hello! I've stumbled on this issue from the Scaphandre repository.

After trying to extend Scaphandre to support GPUs, I eventually started "from scratch" and designed a new measurement tool (though Alumet is not "just" a tool for measuring energy consumption). As the Kepler team mentioned, NVML can report the energy consumption of most NVIDIA GPUs, as well as information on the GPU utilization by different processes, and it works quite well. It's better to measure than to rely on TDP-based estimations anyway. IMO that should be enough to start building some models :)

@seanmcilroy29 seanmcilroy29 mentioned this issue May 17, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jun 3, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jun 14, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jul 2, 2024
13 tasks
@adrianco
Copy link
Contributor Author

adrianco commented Jul 2, 2024

Link to Alumet added to the Miro - It appears that NVIDIA power monitoring is well understood. Next step is to figure out the interfaces for Intel, AMD, Google TPU and AWS Inferentia etc.

@seanmcilroy29 seanmcilroy29 mentioned this issue Jul 16, 2024
15 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jul 29, 2024
12 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Aug 13, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants