[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

makortel · 2019-11-20T19:33:29Z

PR description:

This PR adds overloads for the caching allocators to allocate memory without device-side ownership. These overloads do not take the CUDA stream as an argument. Deallocating such memory blocks from host makes them free immediately without calling cudaEventRecord() (according to VTune cudaEventRecord() had the second-highest total waiting time for locking the mutex in CUDA API).

I changed all unique_ptr's in the CUDADataFormats (that are used in the pixel tracking workflow) that are owned by the data format class to use these overloads. It works because the data format objects are destructed only after all relevant work in their CUDA streams have finished.

This work was done during the NERSC-9 GPU hackathon at Cray offices. On Cori GPU node (V100) I got 14 % (20 % on 2 GPUs) increase in throughput for 2018D JetHT data.

The PR is RFC for two reasons

I'm unsure if the current interface is sufficient for distinguishing whether or not the allocation has device-side ownership
I'm looking forward to see the impact on performance on T4

PR validation:

Profiling workflow runs, unit tests run.

VinInn · 2019-11-22T15:12:30Z

We could try to see if the "new" callback mechanism is faster. In principle we would NOT be able to call cudaFree anymore: do we really need that?

makortel · 2019-11-22T15:50:33Z

We could try to see if the "new" callback mechanism is faster.

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

I'd expect the events to be more performant than callback, because both cudaEventRecord() and cudaStreamAddCallback() lock the mutex, and my impression is that the callback needs "more resources" when executed compared to the setting events state. On the other hand we would gain by not having to call cudaEventQuery() to check the state (don't know if that one locks the mutex or not).

But yeah, could be tested.

In principle we would NOT be able to call cudaFree anymore: do we really need that?

The cudaFree() code path would be called within the DeviceFree() call, i.e. it would not be called from the callback (I believe it would be sufficient for the callback to just set an atomic boolean). (and we are not using that code path anyway)

VinInn · 2019-11-22T15:56:19Z

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

Indeed, my understanding is that it is just a sort of " cpu kernel"

ok for the cudaFree., understood.

makortel · 2019-11-22T16:02:11Z

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

Indeed, my understanding is that it is just a sort of " cpu kernel"

Ah, you refer specifically to the cudaLaunchHostFunc()? Could try that one as well, here it would even suffice (callback not being called in case of device errors won't block the host execution).

VinInn · 2019-11-22T16:39:56Z

Indeed.
In the cuda doc of callback one can read
"This function(cuStreamAddCallback) is slated for eventual deprecation and removal. If you do not require the callback to execute in case of a device error, consider using cuLaunchHostFunc. "

makortel · 2019-11-22T16:58:19Z

Right. cudaLaunchHostFunc() just has a "nasty" feature

Note that, in constrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.

which is a guarantee we need to avoid starting the framework. Ok, it can be worked around if/when we need to go there (I suppose best idea so far is to have a separate CPU thread checking the health of the CUDA runtime e.g. once a second). I had an e-mail thread with Andreas Hehn, @fwyzard, @felicepantaleo during the hackathon in July.

fwyzard · 2019-11-26T15:09:44Z

Validation summary

Reference release CMSSW_11_0_0_pre11 at 5b0a828
Development branch CMSSW_11_0_X_Patatrack at 614ee0b
Testing PRs:

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412 at 9704d00

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/40866e04e975c9068c435d73175657560da6d695/log .

fwyzard · 2019-12-03T16:44:10Z

@makortel low priority, could you fix the conflicts ?

makortel · 2019-12-13T14:37:58Z

Rebased on top of CMSSW_11_0_X_Patatrack (merge of #413).

In that case the memory will be fully freed at the unique_ptr destruction time.

makortel · 2020-01-22T22:41:32Z

Rebased on top of CMSSW_11_0_0_Patatrack.

makortel · 2020-05-18T14:00:39Z

Needs to be rebased after #449 even if there technically there are not merge conflicts.

makortel · 2020-05-18T16:12:08Z

To reduce the confusion between make_unique variants that take and don't take a CUDA stream argument, @fwyzard suggested to highlight the intended use

Event/EventSetup products (without CUDA stream)
temporary data structures (with CUDA stream)

Better ideas are still welcome.

makortel mentioned this pull request Dec 2, 2019

Synchronise with CMSSW_11_0_0_pre12 #420

Closed

makortel force-pushed the allocatorReduceEvent_v2 branch from 9704d00 to 6197223 Compare December 13, 2019 14:32

fwyzard force-pushed the CMSSW_11_0_X_Patatrack branch from 7ec0a22 to e41560b Compare December 13, 2019 15:17

fwyzard added enhancement question labels Jan 18, 2020

makortel added 5 commits January 22, 2020 19:56

Allow device/pinned host allocations without a CUDA stream

7282fb4

In that case the memory will be fully freed at the unique_ptr destruction time.

Remove useless cudaEventRecord()s from digis and clusters

744bda3

Remove useless cudaEventRecord()s from beamspot

f288d36

Remove useless cudaEventRecord()s from TrackingRecHit2DHeterogeneous

6b8bb11

This pinned host pointer doesn't need cudaEventRecord()

c7a5d59

makortel force-pushed the allocatorReduceEvent_v2 branch from 6197223 to c7a5d59 Compare January 22, 2020 22:24

fwyzard added this to the CMSSW_11_1_0_Patatrack milestone Jan 22, 2020

makortel mentioned this pull request Jun 16, 2020

[RFC] Add make_device_unique() functions to ScopedContextBase #487

Draft

makortel mentioned this pull request Jul 8, 2020

Multigpu hcal #498

Closed

makortel mentioned this pull request Sep 3, 2021

[cudadev] Improve caching allocator performance cms-patatrack/pixeltrack-standalone#218

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

makortel commented Nov 20, 2019

VinInn commented Nov 22, 2019 •

edited

Loading

makortel commented Nov 22, 2019

VinInn commented Nov 22, 2019

makortel commented Nov 22, 2019

VinInn commented Nov 22, 2019

makortel commented Nov 22, 2019

fwyzard commented Nov 26, 2019 •

edited

Loading

fwyzard commented Dec 3, 2019

makortel commented Dec 13, 2019

makortel commented Jan 22, 2020

makortel commented May 18, 2020

makortel commented May 18, 2020

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

Are you sure you want to change the base?

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

Conversation

makortel commented Nov 20, 2019

PR description:

PR validation:

VinInn commented Nov 22, 2019 • edited Loading

makortel commented Nov 22, 2019

VinInn commented Nov 22, 2019

makortel commented Nov 22, 2019

VinInn commented Nov 22, 2019

makortel commented Nov 22, 2019

fwyzard commented Nov 26, 2019 • edited Loading

Validation summary

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Dec 3, 2019

makortel commented Dec 13, 2019

makortel commented Jan 22, 2020

makortel commented May 18, 2020

makortel commented May 18, 2020

VinInn commented Nov 22, 2019 •

edited

Loading

fwyzard commented Nov 26, 2019 •

edited

Loading

logs and `nvprof`/`nvvp` profiles