Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race for CUDA event in CachingHostAllocator #237

Merged

Conversation

makortel
Copy link

Currently the mutex in the allocator is unlocked after the memory block is inserted in the free list (cached_bytes), but before calling cudaEventRecord(). This means that there is a short period of time when the memory block is in the free list and the CUDA event status is not cudaErrorNotReady, and thus a HostAllocate() (called in another thread) may consider the memory block to be free in

if(cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady) {

If the memory block was previously associated to a CUDA stream on a different device, the CUDA event gets destroyed and created on the current device. In the original (freeing) thread, the cudaEventRecord() in
if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

gets a CUDA event that is now all the sudden on a different device, leading to the segfaults reported in #197 and #216.

This PR fixes the race condition by unlocking the mutex only after the cudaEventRecord() is called in HostFree().

@fwyzard @VinInn

@makortel makortel mentioned this pull request Dec 20, 2018
@makortel
Copy link
Author

I believe the device allocator has a similar race condition (but being much harder to detect). My PR to the upstream is in https://github.com/NVlabs/cub/pull/156.

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

I have repeated the tested pointed out by @VinInn (32 threads, 16 streams, 4 GPUs) with these changes, and observed no crashes after 60+ trials.

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 6f55d70
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/f38a85495aeb0d46280400368288ef95c8fd1d38/log .

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

I believe the device allocator has a similar race condition (but being much harder to detect). My PR to the upstream is in NVlabs/cub#156.

@makortel, do you think we need to take any actions to handle the possible race in the device allocator, while waiting for it to be addressed upstream ?
A couple of options:

  • patch cub/util_allocator.cuh in our external distribution of CUB
  • make a copy of cub/util_allocator.cuh as HeterogeneousCore/CUDAUtilities/interface/CUDADeviceAllocator.h and patch it locally

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

No changes in physics performance, as expected (see the summaries above).
Impact on the throughput, if present, is negligible (<0.1%).

Before

Warming up
Running 8 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1735.2 ±   1.9 ev/s (4000 events)
  1732.3 ±   2.3 ev/s (4000 events)
  1731.3 ±   2.0 ev/s (4000 events)
  1732.6 ±   2.1 ev/s (4000 events)
  1730.1 ±   2.2 ev/s (4000 events)
  1730.5 ±   1.3 ev/s (4000 events)
  1733.9 ±   2.3 ev/s (4000 events)
  1730.1 ±   1.9 ev/s (4000 events)

After

Warming up
Running 8 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1735.0 ±   2.0 ev/s (4000 events)
  1731.0 ±   1.8 ev/s (4000 events)
  1730.2 ±   1.9 ev/s (4000 events)
  1722.5 ±   1.7 ev/s (4000 events)
  1731.8 ±   1.8 ev/s (4000 events)
  1737.1 ±   1.7 ev/s (4000 events)
  1733.0 ±   1.5 ev/s (4000 events)
  1726.0 ±   2.7 ev/s (4000 events)

@fwyzard fwyzard merged commit ebc280c into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 7, 2019
@makortel
Copy link
Author

makortel commented Jan 7, 2019

@fwyzard I think it would be good to patch the device allocator by ourselves. I don't have a clear opinion between the two options though. I suspect we'll eventually want to craft our own caching (device) allocator, so maybe that would give a slight preference for the own copy? (event if forking our own distribution of CUB and patching that would be cleaner)

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

I prefer as well to make a local copy; the situation with cub is already complicated enough with one version coming in via CUDA via Thrust (.../cmssw/slc7_amd64_gcc700/external/cuda/10.0.130/include/thrust/system/cuda/detail/cub/cub.cuh) and one from its own external (...cmssw/slc7_amd64_gcc700/external/cub/1.8.0-ikaegh/include/cub/cub.cuh).

It also makes it easier to add more debug statements if needed...

@makortel
Copy link
Author

makortel commented Jan 7, 2019

Looks like we have a decision then. Do you want to do it or shall I?

@fwyzard
Copy link

fwyzard commented Jan 7, 2019

I'll have time for it tomorrow, if you don't do it sooner.

@makortel
Copy link
Author

makortel commented Jan 7, 2019

Ok, I'll do it today.

@makortel
Copy link
Author

makortel commented Jan 7, 2019

Done in #240.

@fwyzard fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants