[cudadev] Improve caching allocator performance #218

makortel · 2021-09-03T21:46:59Z

The generalization of the caching allocator in #216 makes it easier to make various improvements to the caching allocator. #211 (comment) shows a measurement pointing that the mutex in the caching allocator would be the bottleneck (my studies ~2 years ago pointed more to the mutex in CUDA, but things seem to have evolved). This PR is to discuss improvement ideas, with a(n ordered) plan shown below

Generalize the caching allocator, done in [cudadev] Generalize caching allocator #216
Improve the interaction of ScopedContext and the caching allocator by having the ScopedContext to pass SharedEventPtr to the caching allocator (evolution of ideas in [RFC] Reduce calls to cudaEventRecord() via the caching allocators cmssw#412 and [RFC] Add make_device_unique() functions to ScopedContextBase cmssw#487)
- This will reduce the number of CUDA events in flight, remove the need to create and destroy them, and simplify the logic especially on the host allocator side
Replace the multiset with nested vectors (device, bin) for (much) faster lookup (from [cudadev] Generalize caching allocator #216 (comment))
Make locking finer grained (e.g. per bin)

The text was updated successfully, but these errors were encountered:

fwyzard · 2021-09-04T05:44:31Z

* Improve the interaction of `ScopedContext` and the caching allocator by having the `ScopedContext` to pass `SharedEventPtr` to the caching allocator (evolution of ideas in

IIUC this will make the caching allocator very tightly bound with the CMS-specific "CUDA framework", right ?

makortel · 2021-09-05T13:57:53Z

* Improve the interaction of `ScopedContext` and the caching allocator by having the `ScopedContext` to pass `SharedEventPtr` to the caching allocator (evolution of ideas in
IIUC this will make the caching allocator very tightly bound with the CMS-specific "CUDA framework", right ?

The caching allocator would get a technical dependence on cms::cuda::SharedEventPtr, and a "behavioral dependence" that it is up to the caller of the allocator to give the SharedEventPtr. (in principle it could be possible to have also API without the SharedEventPtr, but that would add a dependence on cms::cuda::EventCache).

The make_device_unique and make_host_unique would get a dependence on some more abstract version of the Context classes (in CMSSW it would still reside in HetegeneousCore/CUDAUtilities).

Together these would allow

substantially reduce the number of CUDA events being used (1-2 per EDModule)
make the host allocator behavior more sane towards CUDA events (no need to explicitly destroy and create the events)
both device and host allocators to have a memory blocks release to depend on multiple CUDA events
- this would be very beneficial especially for the host allocator in order to improve the ESProducer API and retain support for multiple devices
(in CMSSW) move the ownership of the caching allocator objects (and StreamCache and EventCache as well) from global objects (back) to CUDAService
- this would make it easier to have them destructed in their specific order
- this would also allow the parameters of the caching allocators to be specified (again) in the configuration

ericcano · 2021-09-07T14:26:36Z

Like @fwyzard , I feel worried about extending the scope of the allocator and would prefer having a hierarchy of concerns with base classes solving limited problems while higher level classes build on top of that (although this not always achievable).

There are a few costly features right now in the allocator:

Everything is protected by a big, fat lock at the very top
the free operation is done by address (+ device id) and requires a full search among the live blocks to be found
the free operation is 2 step: free for the thread and global free. The transition from free for thread to free for all is asynchronous and device driven.
- partial (stream local) free requires to call cudaEventRecord
- complete free can then be checked by polling considered blocks using cudaEventCheck the number of calls to cudaEventCheck is then hard to control
- the current allocator does not choose the local stream free blocks in priority (this would be better to re-allocate locally, as the block is not yet available to other threads
As @makortel and @fwyzard mentioned, we can cut the problem into multiple hierarchical containers with local locks (possibly read/write or even upgradable (Boost's upgrade_mutex).

A run of cudadev (before applying SoA, with 10k event, 16 threads, 24 EVM streams) gives:

 -------  ---------------  ---------  ------------  -------  -----------  ------------  ----------------------------------
    38.3   12,226,942,261    480,004      25,472.6      456    8,463,443     135,263.0  cudaEventRecord
    29.7    9,473,463,780    410,000      23,106.0    3,110    1,602,429      71,967.4  cudaLaunchKernel
    14.6    4,645,436,630     70,000      66,363.4    1,201   10,120,278     422,502.6  cudaEventSynchronize
     6.0    1,929,961,064    100,010      19,297.7    2,618    1,930,767      60,743.5  cudaMemcpyAsync
     4.4    1,411,087,349    542,793       2,599.7      418      558,245       6,032.6  cudaEventQuery
     2.4      776,820,900     10,000      77,682.1    5,175    6,636,408     261,967.3  cudaStreamAddCallback
     1.4      453,200,745     20,000      22,660.0    3,591      876,043      55,909.3  cudaMemsetAsync
     1.4      440,901,767        687     641,778.4    1,885   23,329,175   1,797,406.9  cudaMalloc
     0.7      213,362,178         48   4,445,045.4    5,243   31,634,080   7,169,447.1  cudaStreamCreateWithFlags
     0.5      154,312,866          3  51,437,622.0   12,211  154,271,346  89,056,617.8  cudaMemGetInfo
     0.2       78,315,362        781     100,275.8      437   12,439,093     623,922.2  cudaEventCreateWithFlags
     0.2       58,059,859        122     475,900.5    3,967    8,005,573   1,002,263.7  cudaHostAlloc
     0.1       42,505,965        687      61,871.9    1,794      823,937     130,922.3  cudaFree
     0.0       12,568,727        122     103,022.4    4,178    1,054,201     198,971.7  cudaFreeHost
     0.0        2,558,112        781       3,275.4      433    1,899,666      67,951.0  cudaEventDestroy
[...]

We can see from this that the caching allocator does a good job caching (only about 687 device +122 host real allocations (for cudaMalloc + cudaHostAlloc or about 781 from event creation) for 410k frees (from the cudaEventRecord count).

We could imagine to replace cudaEventRecord + cudaEventQuery (+ cudaEventCreateWithFlags + cudaEventDestroy) with a single cudaStreamAddCallback for the stream local to global free transition. That would allow a more powerful change and moving the memory block to a different place allowing a local-then-global allocation strategy for blocks. The data structure would be cleaner. Nevertheless, the performance gain is very uncertain as the per-cal cost of cudaStreamAddCallback in this example exceeds the cost of cudaEventRecord + cudaEventQuery by a factor 3.

(Regarding the events, I extrapolated that we have 7 CUDA events per CMS event due to the EDM (using cudaEventSynchronize) and 41 allocations (frees to be precise) per CMS event. )

makortel · 2021-09-07T18:21:05Z

We could imagine to replace cudaEventRecord + cudaEventQuery (+ cudaEventCreateWithFlags + cudaEventDestroy) with a single cudaStreamAddCallback for the stream local to global free transition. That would allow a more powerful change and moving the memory block to a different place allowing a local-then-global allocation strategy for blocks. The data structure would be cleaner. Nevertheless, the performance gain is very uncertain as the per-cal cost of cudaStreamAddCallback in this example exceeds the cost of cudaEventRecord + cudaEventQuery by a factor 3.

I'm concerned, as you noted as well, that doing cudaStreamAddCallback() for each memory allocation would be (very) costly.

At the time of cms-patatrack/cmssw#412 I actually tried to use cudaLaunchHostFunc() instead of the CUDA events but IIRC I had various runtime errors that did not have too much time to debug (that I, unfortunately, did not document in that PR).

My motivation to move the CUDA event to be given in the allocation call stems from

The ProduceScopedContext already records a CUDA event at the end of produce()
- Adding a cudaEventRecord() call to acquire() would still be at most as expensive as the cudaEventRecord() calls in the allocator, and would be cheaper if more than one allocation are made in acquire()
Memory blocks used in event data products don't benefit from "global free" feature, because all asynchronous work to access them is guaranteed to be finished at the time of their host-side destruction ("thread free") by the framework
The EventSetup side needs a complete overhaul for many reasons. In order to utilize the caching allocator there (e.g. the delayed "global free" would be handy), (at least) the CachingHostAllocator would need to be extended to support multiple CUDA events per memory block (one for each device)
- Also in the ES case one CUDA event could be easily shared between many memory blocks
- If we ever get to improve early deletion of event data products (Improve the delete early ability of the Framework cms-sw/cmssw#16481, could well be "never"), the caching allocator would also need to track CUDA events for all the consumers of (relevant) event products. Doing such tracking at the level of event products (using e.g. the CUDA event address as a key) sounds much easier than doing it at the level of memory blocks.

ericcano · 2021-09-08T10:17:49Z

OK, I see, we explicitly create the event and all users (here allocation) ride the same event, pushing the optimization beyond memory allocation.

I now have a few SoAs in place, and the memory allocation is consoliated, so we should get a similar effect (I targeted the places where most consolidation can be obtained first). Running current master (d78674b) and currently rebased https://github.com/ericcano/pixeltrack-standalone/tree/macroBasedSoA (ericcano@b1fd2d0).

I still get the similar statistics in the reference:

 Time(%)  Total Time (ns)  Num Calls    Average     Minimum    Maximum       StdDev                    Name
 -------  ---------------  ---------  ------------  -------  -----------  ------------  ----------------------------------
    39.9   10,981,591,635    480,004      22,878.1      454    6,463,016     115,452.7  cudaEventRecord
    32.1    8,832,374,557    410,000      21,542.4    2,816    2,493,569      68,125.8  cudaLaunchKernel
     9.9    2,735,516,299     70,000      39,078.8      962    8,047,639     300,851.8  cudaEventSynchronize
     6.4    1,762,799,580    100,010      17,626.2    2,354    2,469,649      57,116.4  cudaMemcpyAsync
     4.7    1,296,957,178    541,851       2,393.6      418    3,086,495       7,168.5  cudaEventQuery
     2.4      668,500,639     10,000      66,850.1    4,717    3,810,799     198,101.6  cudaStreamAddCallback
     1.5      407,345,863     20,000      20,367.3    3,513      829,759      49,436.2  cudaMemsetAsync
     1.2      341,704,242        675     506,228.5    1,969    7,941,491   1,044,261.9  cudaMalloc
     0.5      149,914,455          3  49,971,485.0   14,510  149,870,551  86,515,129.3  cudaMemGetInfo
     0.5      125,778,473         48   2,620,384.9    4,655   18,017,534   4,386,477.4  cudaStreamCreateWithFlags
     0.3       90,361,589        770     117,352.7      439   14,779,844     870,213.3  cudaEventCreateWithFlags
     0.3       75,826,065        124     611,500.5    4,226    6,528,841   1,071,075.8  cudaHostAlloc
     0.1       39,951,238        675      59,187.0    1,821      819,582     126,802.3  cudaFree
     0.0       11,512,147        124      92,839.9    4,229      855,437     175,178.9  cudaFreeHost

With the SoA in, the calls to cudaEventRecord are dropped significantly:

 Time(%)  Total Time (ns)  Num Calls    Average     Minimum    Maximum       StdDev                    Name
 -------  ---------------  ---------  ------------  -------  -----------  ------------  ----------------------------------
    40.4   12,607,667,032    350,007      36,021.2      446    6,709,955     145,819.5  cudaEventRecord
    33.1   10,313,239,658    410,000      25,154.2    3,108    2,019,457      73,718.7  cudaLaunchKernel
     8.3    2,574,231,252     70,000      36,774.7    1,013    8,551,951     291,658.7  cudaEventSynchronize
     6.4    1,982,934,092     80,010      24,783.6    2,526    2,151,922      71,624.0  cudaMemcpyAsync
     3.9    1,214,230,623    398,472       3,047.2      426      582,961       6,526.3  cudaEventQuery
     2.8      873,035,733     10,000      87,303.6    5,283    6,948,089     256,767.5  cudaStreamAddCallback
     1.8      552,948,400     20,000      27,647.4    3,696    2,036,265      63,851.8  cudaMemsetAsync
     1.6      484,866,449        427   1,135,518.6    2,309   10,954,448   1,876,766.5  cudaMalloc
     0.6      172,115,971         48   3,585,749.4    4,704   17,267,533   4,859,098.5  cudaStreamCreateWithFlags
     0.5      151,382,954          3  50,460,984.7   28,434  151,318,036  87,344,768.7  cudaMemGetInfo
     0.4      121,128,985        522     232,047.9      457   17,542,393   1,220,848.7  cudaEventCreateWithFlags
     0.2       73,806,234        118     625,476.6    4,504    9,961,388   1,384,755.7  cudaHostAlloc
     0.1       37,799,096        427      88,522.5    1,905      818,331     163,629.7  cudaFree
     0.0       14,169,408        118     120,079.7    3,978    1,787,986     248,001.3  cudaFreeHost

Yet the global performance is the same (16 threads, 24 streams, 10k events, no transfers, no validation). The reference gets 887 events/s, while SoA gets 891 events/s.

The global cost of cudaEventRecord seems to be the same or similar when cutting extra calls from a streak of calls, maybe due to internal caching in CUDA. On top, I found an example in NSight where many short cudaEventRecord happened back to back without spending time in pthread_mutex_lock, like if a rapid unlock-lock sequence allowed the thread to re-aquire the lock. I had a look at the glibc implementation and did not find hints for this kind of optimization (it could be in the kernel side futex implementation but I did not look that far).

All the green blocks in the following plot are cudaEventRecord:

We can see the neighboring thread remaining locked during this streak:

Tooltips confirm both are working on device memory and hence working with the same mutex.

Finally, looking from the system call perspective, we can see not so many pthread_mutex_lock, in both cases. This is still the leading cause of wait:
Before SoA:

 Time(%)  Total Time (ns)  Num Calls    Average    Minimum    Maximum      StdDev              Name
 -------  ---------------  ---------  -----------  -------  -----------  -----------  ----------------------
    82.7  244,023,111,016     83,747  2,913,813.2    1,000   35,575,851  2,547,263.8  pthread_mutex_lock
     5.9   17,533,523,642     22,525    778,402.8    1,152  100,388,901  1,793,027.7  poll
     5.8   17,207,836,507      8,734  1,970,212.6    9,898   17,748,541  1,759,383.3  sem_wait
     3.6   10,533,160,773     46,374    227,135.0    1,000   17,979,282    377,309.6  pthread_rwlock_wrlock
     1.9    5,567,019,409     31,396    177,316.2    1,001    1,540,021    129,730.1  pthread_rwlock_rdlock

With SoA:

 Time(%)  Total Time (ns)  Num Calls    Average    Minimum    Maximum      StdDev              Name
 -------  ---------------  ---------  -----------  -------  -----------  -----------  ----------------------
    81.9  252,045,204,557     81,204  3,103,852.1    1,000   31,571,413  2,722,404.5  pthread_mutex_lock
     5.9   18,233,605,288     23,598    772,675.9    1,021  100,391,773  1,814,575.5  poll
     5.8   17,890,932,693      8,911  2,007,735.7    7,531   17,843,847  1,775,470.6  sem_wait
     4.1   12,501,952,523     53,618    233,167.1    1,005   17,403,770    416,252.1  pthread_rwlock_wrlock
     2.2    6,708,830,058     38,577    173,907.5    1,002      919,867    129,839.1  pthread_rwlock_rdlock

I would imterpret this low number from the fact that the system calls were short enough to go under nsys's sampling radar. An strace -cf of both runs confirm a more reasonable ~1M calls to futex for the baseline and ~750k for this SoA version.

Aa a conclusion it seems CUDA's (and Linux's) internal optimizations are already working around our repetitive calling of cudaEventRecord and pthread_mutex_lock, so reducing them will yield limited performance improvement, and reducing the size of contention domains in the allocator is therefore the best approach, allowing better parallelism.

makortel added the cuda label Sep 3, 2021

This was referenced Sep 3, 2021

[cudadev] Macro based SoA #211

Closed

[cudadev] Generalize caching allocator #216

Merged

makortel mentioned this issue Oct 26, 2021

[cudadev][RFC] Prototype (host|device)_unique_ptr API to use lightweight "Context" object instead of CUDA stream #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudadev] Improve caching allocator performance #218

[cudadev] Improve caching allocator performance #218

makortel commented Sep 3, 2021 •

edited

Loading

fwyzard commented Sep 4, 2021

makortel commented Sep 5, 2021

ericcano commented Sep 7, 2021

makortel commented Sep 7, 2021

ericcano commented Sep 8, 2021

[cudadev] Improve caching allocator performance #218

[cudadev] Improve caching allocator performance #218

Comments

makortel commented Sep 3, 2021 • edited Loading

fwyzard commented Sep 4, 2021

makortel commented Sep 5, 2021

ericcano commented Sep 7, 2021

makortel commented Sep 7, 2021

ericcano commented Sep 8, 2021

makortel commented Sep 3, 2021 •

edited

Loading