-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a flag to disable the caching for the allocators #205
Add a flag to disable the caching for the allocators #205
Conversation
6110cf4
to
b8334a2
Compare
Something doesn't look right if I disable the allocator:
Enabling some debug messages I get
|
Right, because the deallocation doesn't work that simply (whoops). I mean, with the caching enabled, one can destruct |
OK, if I follow what you mean, I think I would prefer the allocator not to "hide" this... |
@fwyzard So you'd want the "freed" device memory to be available for everyone immediately after the free? |
No, what I mean is that I would like to use of the allocator not to change the That is, once a block is returned to the allocator, it should be considered "available" for reuse, and if some (GPU ?) code uses the memory after the (CPU ?) code has returned it, bad things can happen. Of course, as long as there is enough free memory, the allocator can hold on to the block, and try to give it back only to the same CUDA stream that used it last (or whatever caching we do). But, the user code should not be allowed to rely on this behaviour for correctness, only for speed optimisations. Does it make sense ? |
Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the For large memory chunks (bigger than the largest bin):
For small memory chunks (up to the size of the largest bin):
Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):
Here
If the allocator is replaced by direct calls to |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:
The I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time. |
Few more thing to consider:
|
@fwyzard I agree with your assessment in #205 (comment). On your questions
I was thinking to add a caching managed allocator and try that out. (should we move the discussion not being exactly on "simple way to disable caching for testing" back to #138 ?) |
Sure... I'll copy there my comments. |
1d2607f
to
ace925f
Compare
Rebased on top of With handful of tests (with multiple threads, EDM streams, CUDA devices) I have not been able to reproduce #205 (comment). |
Indeed, I have not run into any problems testing today, neither on a pair of P100 nor V100, neither with the "profile" workflow on data nor with the "step3" workflow on TTbar. |
On a pair of P100:
On a pair of V100:
|
Following the discussion in #172, this PR adds a flag to
CUDAService
to disable the caching for the allocators (to demonstrate how bad event-by-eventcudaMalloc
+cudaFree
would be). Default behavior is unchanged, the caching can be disabled withprocess.CUDAService.allocator.enabled = False
.