py3nvml measures reserved and not used memory #31

fxmarty · 2023-08-18T11:13:45Z

We can check with nvidia-smi -q that what py3nvml claims to be the used memory is actually not.

This can mess the measurement since memory can be reserved by other processes, see e.g. https://forums.developer.nvidia.com/t/freeing-up-some-of-the-reserved-memory/257814

An alternative is to use

command = "nvidia-smi --query-gpu=memory.used --format=csv --id=0"
subprocess.check_output(command.split()).decode('ascii').split('\n')[1].split()[0]
gpu_mem_mb = int(gpu_mem_mb) * 1.048576

The text was updated successfully, but these errors were encountered:

IlyasMoutawwakil · 2023-08-18T11:54:04Z

Nice catch!
The nvidia-smi method might be better tbh since it will also allow for better peak memory capturing.

fxmarty · 2023-08-18T12:18:23Z

It is a bit tricky really because I guess (not sure) what matters for OOM is the reserved memory. But I did not find a straightforward way to get the reserved memory by a PID.

Interesting related metrics could be the "maximum usable batch size" or "maximum usable sequence length", if that makes sense. Which would need us to try catch on OOM errors.

IlyasMoutawwakil · 2023-11-03T11:18:58Z

we are now reporting, allocated (torch), reserved (torch) and "used" (pynvml) memory.
I also kept peak_memory in the results file with a deprecation error.
also I should probably switch / investigate the official bindings https://pypi.org/project/nvidia-ml-py/

IlyasMoutawwakil · 2023-11-29T05:35:37Z

solved in #81 by reporting all three memory types: allocated (pytorch), reserved (pytorch) and used memory (pynvml).

IlyasMoutawwakil closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py3nvml measures reserved and not used memory #31

py3nvml measures reserved and not used memory #31

fxmarty commented Aug 18, 2023 •

edited

Loading

IlyasMoutawwakil commented Aug 18, 2023

fxmarty commented Aug 18, 2023 •

edited

Loading

IlyasMoutawwakil commented Nov 3, 2023 •

edited

Loading

IlyasMoutawwakil commented Nov 29, 2023

py3nvml measures reserved and not used memory #31

py3nvml measures reserved and not used memory #31

Comments

fxmarty commented Aug 18, 2023 • edited Loading

IlyasMoutawwakil commented Aug 18, 2023

fxmarty commented Aug 18, 2023 • edited Loading

IlyasMoutawwakil commented Nov 3, 2023 • edited Loading

IlyasMoutawwakil commented Nov 29, 2023

fxmarty commented Aug 18, 2023 •

edited

Loading

fxmarty commented Aug 18, 2023 •

edited

Loading

IlyasMoutawwakil commented Nov 3, 2023 •

edited

Loading