Add a flag to disable the caching for the allocators #205

makortel · 2018-11-28T22:15:06Z

Following the discussion in #172, this PR adds a flag to CUDAService to disable the caching for the allocators (to demonstrate how bad event-by-event cudaMalloc+cudaFree would be). Default behavior is unchanged, the caching can be disabled with process.CUDAService.allocator.enabled = False.

fwyzard · 2018-11-29T13:22:15Z

Something doesn't look right if I disable the allocator:

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 5 at 29-Nov-2018 14:06:41.870 CET
Begin processing the 101st record. Run 321177, Event 187660569, LumiSection 142 on stream 2 at 29-Nov-2018 14:06:49.634 CET
Begin processing the 201st record. Run 321177, Event 188192711, LumiSection 142 on stream 6 at 29-Nov-2018 14:06:50.127 CET
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHits.cu, line 159: cudaErrorInvalidConfiguration: invalid configuration argument


A fatal system signal has occurred: abort signal
...

Enabling some debug messages I get

...
hitbuilder: 24 clusters in module 1701. will write at 5547
hitbuilder: 8 clusters in module 101. will write at 561
launching getHits kernel for 801 blocks
hitbuilder: 9 clusters in module 1601. will write at 4851
hitbuilder: 4 clusters in module 1001. will write at 1121
hitbuilder: 16 clusters in module 1201. will write at 1426
hitbuilder: 9 clusters in module 1301. will write at 2118
hitbuilder: 15 clusters in module 1801. will write at 6666
hitbuilder: 6 clusters in module 1501. will write at 3888
hitbuilder: 8 clusters in module 1701. will write at 5787
hitbuilder: 9 clusters in module 1401. will write at 2960
launching getHits kernel for 295 blocks
launching getHits kernel for 0 blocks
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHits.cu, line 159: cudaErrorInvalidConfiguration: invalid configuration argument

makortel · 2018-11-29T15:00:32Z

Right, because the deallocation doesn't work that simply (whoops). I mean, with the caching enabled, one can destruct device::unique_ptr even if the device memory is still being used (it won't be re-used until a CUDA event recorded on the CUDA stream at the point of unique_ptr destructor has become occurred). Need to think a bit more...

fwyzard · 2018-11-29T15:04:59Z

OK, if I follow what you mean, I think I would prefer the allocator not to "hide" this...

makortel · 2018-11-29T19:12:57Z

@fwyzard So you'd want the "freed" device memory to be available for everyone immediately after the free?

fwyzard · 2018-11-30T12:54:21Z

No, what I mean is that I would like to use of the allocator not to change the malloc/free semantic.

That is, once a block is returned to the allocator, it should be considered "available" for reuse, and if some (GPU ?) code uses the memory after the (CPU ?) code has returned it, bad things can happen.

Of course, as long as there is enough free memory, the allocator can hold on to the block, and try to give it back only to the same CUDA stream that used it last (or whatever caching we do).

But, the user code should not be allowed to rely on this behaviour for correctness, only for speed optimisations.

Does it make sense ?

fwyzard · 2018-11-30T16:42:05Z

Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the cub::CachingDeviceAllocator.

For large memory chunks (bigger than the largest bin):

allocations are synchronous, using cudaMalloc
deallocations are synchronous, using cudaFree

For small memory chunks (up to the size of the largest bin):

allocations can be synchronous, using cudaMalloc or memory returned to the pool from an idle CUDA stream, or asynchronous, reusing memory "freed" in the current CUDA stream
deallocations are asynchronous, marking the memory are reusable by the current CUDA stream

Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):

dev_mem = allocate(1024, stream);
kernel<<<1,1, 0, stream>>>(dev_mem);
cudaMemcpyAsync(host_mem, dev_mem, stream);
free(dev_mem);
dev_mem2 = allocate(1024, stream);
kernel2<<<1,1, 0, stream>>>(dev_mem2);

Here free(dev_mem) marks the chunk used by dev_mem as "available within the stream stream".
dev_mem2 is likely to receive a pointer to the same chunk.
Launching kernel2 using dev_mem2 is assumed to be fine, because the operations associated to the CUDA stream stream are serial:

execute kernel that possibly writes to the chunk via dev_mem;
read the memory via dev_mem;
execute kernel2 that possibly writes to the chunk via dev_mem2.

If the allocator is replaced by direct calls to cudaMalloc/cudaFree, the behaviour changes, and the assumption is no longer valid.

fwyzard · 2018-12-02T07:55:21Z

I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:

synchronous: memory operations on the device are synchronous with the host: memory is allocated and deallocated immediately (à la cudaMalloc()/cudaFree())
asynchronous: memory operations on the device are asynchronous from the host, and synchronous with a CUDA stream (or other constructs like CUDA graphs)
lazy: memory allocations on the device are synchronous with the host (cudaMalloc()), while memory deallocations are asynchronous from the host, and synchronous with a CUDA stream.

The cub::CachingDeviceAllocator is doing a mixture of all three: 1. for large memory chunks, 2. for small memory chunks recycled from the pool, 3. for small memory chunks when the allocation pool is grown.

I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time.

fwyzard · 2018-12-02T08:03:32Z

Few more thing to consider:

do we want to be able to allocate and deallocate memory from device code ?
do we want to be able to allocate memory on the host, and deallocate it from the device(or the opposite) ?
do we want to support managed memory ?

makortel · 2018-12-03T17:13:16Z

@fwyzard I agree with your assessment in #205 (comment). On your questions

do we want to support managed memory ?

I was thinking to add a caching managed allocator and try that out.

(should we move the discussion not being exactly on "simple way to disable caching for testing" back to #138 ?)

fwyzard · 2018-12-03T17:21:05Z

Sure... I'll copy there my comments.

makortel · 2019-01-02T23:30:15Z

Rebased on top of 10_4_0_pre3_Patatrack.

With handful of tests (with multiple threads, EDM streams, CUDA devices) I have not been able to reproduce #205 (comment).

fwyzard · 2019-01-09T08:50:18Z

Indeed, I have not run into any problems testing today, neither on a pair of P100 nor V100, neither with the "profile" workflow on data nor with the "step3" workflow on TTbar.

fwyzard · 2019-01-09T12:40:27Z

On a pair of P100:

$ multirun.py profile.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
  1904.3 ±   1.5 ev/s (4000 events, 98.9% overlap)
  1903.8 ±   1.3 ev/s (4000 events, 98.6% overlap)
  1904.5 ±   1.7 ev/s (4000 events, 98.8% overlap)
  1904.8 ±   1.7 ev/s (4000 events, 98.8% overlap)

$ multirun.py profile-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
   508.9 ±   0.2 ev/s (4000 events, 99.8% overlap)
   510.3 ±   0.2 ev/s (4000 events, 99.6% overlap)
   510.2 ±   0.2 ev/s (4000 events, 99.9% overlap)
   507.6 ±   0.2 ev/s (4000 events, 99.9% overlap)

$ multirun.py step3.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.3 ±   0.0 ev/s (300 events, 99.2% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.9% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.6% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.9% overlap)

$ multirun.py step3-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.8% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)

On a pair of V100:

$ multirun.py profile.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
  2814.7 ±   2.0 ev/s (4000 events, 99.4% overlap)
  2818.8 ±   2.4 ev/s (4000 events, 99.6% overlap)
  2814.9 ±   3.1 ev/s (4000 events, 99.5% overlap)
  2814.1 ±   3.0 ev/s (4000 events, 99.1% overlap)

$ multirun.py profile-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
   518.5 ±   0.2 ev/s (4000 events, 99.3% overlap)
   516.0 ±   0.2 ev/s (4000 events, 99.0% overlap)
   518.7 ±   0.2 ev/s (4000 events, 99.7% overlap)
   508.9 ±   0.1 ev/s (4000 events, 98.8% overlap)

$ multirun.py step3.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.8 ±   0.0 ev/s (300 events, 99.6% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.6% overlap)
     3.9 ±   0.0 ev/s (300 events, 98.3% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.4% overlap)

$ multirun.py step3-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.8 ±   0.0 ev/s (300 events, 99.8% overlap)
     3.9 ±   0.0 ev/s (300 events, 98.7% overlap)
     3.8 ±   0.0 ev/s (300 events, 98.7% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.7% overlap)

fwyzard force-pushed the CMSSW_10_4_X_Patatrack branch from 6110cf4 to b8334a2 Compare November 29, 2018 09:44

fwyzard changed the base branch from CMSSW_10_4_X_Patatrack to CMSSW_10_4_X November 29, 2018 12:59

fwyzard changed the base branch from CMSSW_10_4_X to CMSSW_10_4_X_Patatrack November 29, 2018 12:59

fwyzard added enhancement later and removed enhancement labels Nov 30, 2018

This comment has been minimized.

Sign in to view

makortel mentioned this pull request Jan 2, 2019

implement simple memory "working space" #138

Open

Add a flag to disable the caching for the allocators

ace925f

makortel force-pushed the cudaServiceAllocNoCache branch from 1d2607f to ace925f Compare January 2, 2019 23:28

fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019

Merge branch 'CMSSW_10_4_X_Patatrack' into cudaServiceAllocNoCache

3e641f7

fwyzard merged commit 59fe318 into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 9, 2019

fwyzard added fixed and removed later labels Jan 9, 2019

fwyzard pushed a commit that referenced this pull request Jan 9, 2019

Add a flag to disable caching for the allocators (#205)

db3e6f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a flag to disable the caching for the allocators #205

Add a flag to disable the caching for the allocators #205

makortel commented Nov 28, 2018

fwyzard commented Nov 29, 2018

makortel commented Nov 29, 2018

fwyzard commented Nov 29, 2018

makortel commented Nov 29, 2018

fwyzard commented Nov 30, 2018

fwyzard commented Nov 30, 2018 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

fwyzard commented Dec 2, 2018 •

edited

Loading

fwyzard commented Dec 2, 2018 •

edited

Loading

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Jan 2, 2019

fwyzard commented Jan 9, 2019

fwyzard commented Jan 9, 2019

Add a flag to disable the caching for the allocators #205

Add a flag to disable the caching for the allocators #205

Conversation

makortel commented Nov 28, 2018

fwyzard commented Nov 29, 2018

makortel commented Nov 29, 2018

fwyzard commented Nov 29, 2018

makortel commented Nov 29, 2018

fwyzard commented Nov 30, 2018

fwyzard commented Nov 30, 2018 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

fwyzard commented Dec 2, 2018 • edited Loading

fwyzard commented Dec 2, 2018 • edited Loading

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Jan 2, 2019

fwyzard commented Jan 9, 2019

fwyzard commented Jan 9, 2019

fwyzard commented Nov 30, 2018 •

edited

Loading

fwyzard commented Dec 2, 2018 •

edited

Loading

fwyzard commented Dec 2, 2018 •

edited

Loading