-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py3nvml measures reserved and not used memory #31
Comments
Nice catch! |
It is a bit tricky really because I guess (not sure) what matters for OOM is the reserved memory. But I did not find a straightforward way to get the reserved memory by a PID. Interesting related metrics could be the "maximum usable batch size" or "maximum usable sequence length", if that makes sense. Which would need us to try catch on OOM errors. |
we are now reporting, allocated (torch), reserved (torch) and "used" (pynvml) memory. |
solved in #81 by reporting all three memory types: allocated (pytorch), reserved (pytorch) and used memory (pynvml). |
We can check with
nvidia-smi -q
that what py3nvml claims to be the used memory is actually not.See: fbcotter/py3nvml#25
This can mess the measurement since memory can be reserved by other processes, see e.g. https://forums.developer.nvidia.com/t/freeing-up-some-of-the-reserved-memory/257814
An alternative is to use
The text was updated successfully, but these errors were encountered: