Fix a race for CUDA event in CachingHostAllocator #237

makortel · 2018-12-20T21:18:39Z

Currently the mutex in the allocator is unlocked after the memory block is inserted in the free list (cached_bytes), but before calling cudaEventRecord(). This means that there is a short period of time when the memory block is in the free list and the CUDA event status is not cudaErrorNotReady, and thus a HostAllocate() (called in another thread) may consider the memory block to be free in

cmssw/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h

Line 393 in 6f55d70

if(cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady) {

If the memory block was previously associated to a CUDA stream on a different device, the CUDA event gets destroyed and created on the current device. In the original (freeing) thread, the cudaEventRecord() in

cmssw/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h

Line 551 in 6f55d70

    
           if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

gets a CUDA event that is now all the sudden on a different device, leading to the segfaults reported in #197 and #216.

This PR fixes the race condition by unlocking the mutex only after the cudaEventRecord() is called in HostFree().

@fwyzard @VinInn

makortel · 2018-12-20T21:23:27Z

I believe the device allocator has a similar race condition (but being much harder to detect). My PR to the upstream is in https://github.com/NVlabs/cub/pull/156.

fwyzard · 2019-01-07T13:14:48Z

I have repeated the tested pointed out by @VinInn (32 threads, 16 streams, 4 GPUs) with these changes, and observed no crashes after 60+ trials.

fwyzard · 2019-01-07T13:22:00Z

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 6f55d70
Testing PRs:

Fix a race for CUDA event in CachingHostAllocator #237 at 58b6eaf

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found 46085 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/f38a85495aeb0d46280400368288ef95c8fd1d38/log .

fwyzard · 2019-01-07T13:25:39Z

I believe the device allocator has a similar race condition (but being much harder to detect). My PR to the upstream is in NVlabs/cub#156.

@makortel, do you think we need to take any actions to handle the possible race in the device allocator, while waiting for it to be addressed upstream ?
A couple of options:

patch cub/util_allocator.cuh in our external distribution of CUB
make a copy of cub/util_allocator.cuh as HeterogeneousCore/CUDAUtilities/interface/CUDADeviceAllocator.h and patch it locally

fwyzard · 2019-01-07T15:29:24Z

No changes in physics performance, as expected (see the summaries above).
Impact on the throughput, if present, is negligible (<0.1%).

Before

Warming up
Running 8 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1735.2 ±   1.9 ev/s (4000 events)
  1732.3 ±   2.3 ev/s (4000 events)
  1731.3 ±   2.0 ev/s (4000 events)
  1732.6 ±   2.1 ev/s (4000 events)
  1730.1 ±   2.2 ev/s (4000 events)
  1730.5 ±   1.3 ev/s (4000 events)
  1733.9 ±   2.3 ev/s (4000 events)
  1730.1 ±   1.9 ev/s (4000 events)

After

Warming up
Running 8 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1735.0 ±   2.0 ev/s (4000 events)
  1731.0 ±   1.8 ev/s (4000 events)
  1730.2 ±   1.9 ev/s (4000 events)
  1722.5 ±   1.7 ev/s (4000 events)
  1731.8 ±   1.8 ev/s (4000 events)
  1737.1 ±   1.7 ev/s (4000 events)
  1733.0 ±   1.5 ev/s (4000 events)
  1726.0 ±   2.7 ev/s (4000 events)

makortel · 2019-01-07T15:38:46Z

@fwyzard I think it would be good to patch the device allocator by ourselves. I don't have a clear opinion between the two options though. I suspect we'll eventually want to craft our own caching (device) allocator, so maybe that would give a slight preference for the own copy? (event if forking our own distribution of CUB and patching that would be cleaner)

fwyzard · 2019-01-07T15:49:31Z

I prefer as well to make a local copy; the situation with cub is already complicated enough with one version coming in via CUDA via Thrust (.../cmssw/slc7_amd64_gcc700/external/cuda/10.0.130/include/thrust/system/cuda/detail/cub/cub.cuh) and one from its own external (...cmssw/slc7_amd64_gcc700/external/cub/1.8.0-ikaegh/include/cub/cub.cuh).

It also makes it easier to add more debug statements if needed...

makortel · 2019-01-07T15:57:33Z

Looks like we have a decision then. Do you want to do it or shall I?

fwyzard · 2019-01-07T16:00:07Z

I'll have time for it tomorrow, if you don't do it sooner.

makortel · 2019-01-07T16:03:31Z

Ok, I'll do it today.

makortel · 2019-01-07T17:07:35Z

Done in #240.

Need to keep the lock until CUDA event is recorded

58b6eaf

makortel mentioned this pull request Dec 20, 2018

Full workflow on GPU #197

Closed

fwyzard merged commit ebc280c into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 7, 2019

makortel mentioned this pull request Jan 7, 2019

Make a local copy of CUB's CachingDeviceAllocator and patch it for the race condition #240

Merged

fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a race for CUDA event in CachingHostAllocator #237

Fix a race for CUDA event in CachingHostAllocator #237

makortel commented Dec 20, 2018

makortel commented Dec 20, 2018

fwyzard commented Jan 7, 2019

fwyzard commented Jan 7, 2019 •

edited

Loading

fwyzard commented Jan 7, 2019 •

edited

Loading

fwyzard commented Jan 7, 2019

makortel commented Jan 7, 2019

fwyzard commented Jan 7, 2019 •

edited

Loading

makortel commented Jan 7, 2019

fwyzard commented Jan 7, 2019

makortel commented Jan 7, 2019

makortel commented Jan 7, 2019

Fix a race for CUDA event in CachingHostAllocator #237

Fix a race for CUDA event in CachingHostAllocator #237

Conversation

makortel commented Dec 20, 2018

makortel commented Dec 20, 2018

fwyzard commented Jan 7, 2019

fwyzard commented Jan 7, 2019 • edited Loading

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Jan 7, 2019 • edited Loading

fwyzard commented Jan 7, 2019

Before

After

makortel commented Jan 7, 2019

fwyzard commented Jan 7, 2019 • edited Loading

makortel commented Jan 7, 2019

fwyzard commented Jan 7, 2019

makortel commented Jan 7, 2019

makortel commented Jan 7, 2019

fwyzard commented Jan 7, 2019 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

fwyzard commented Jan 7, 2019 •

edited

Loading

fwyzard commented Jan 7, 2019 •

edited

Loading