-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412
base: CMSSW_11_0_X_Patatrack
Are you sure you want to change the base?
[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412
Conversation
We could try to see if the "new" callback mechanism is faster. In principle we would NOT be able to call cudaFree anymore: do we really need that? |
Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership? I'd expect the events to be more performant than callback, because both But yeah, could be tested.
The |
Indeed, my understanding is that it is just a sort of " cpu kernel" ok for the cudaFree., understood. |
Ah, you refer specifically to the |
Indeed. |
Right.
which is a guarantee we need to avoid starting the framework. Ok, it can be worked around if/when we need to go there (I suppose best idea so far is to have a separate CPU thread checking the health of the CUDA runtime e.g. once a second). I had an e-mail thread with Andreas Hehn, @fwyzard, @felicepantaleo during the hackathon in July. |
Validation summaryReference release CMSSW_11_0_0_pre11 at 5b0a828 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
@makortel low priority, could you fix the conflicts ? |
9704d00
to
6197223
Compare
Rebased on top of |
7ec0a22
to
e41560b
Compare
In that case the memory will be fully freed at the unique_ptr destruction time.
6197223
to
c7a5d59
Compare
Rebased on top of CMSSW_11_0_0_Patatrack. |
Needs to be rebased after #449 even if there technically there are not merge conflicts. |
To reduce the confusion between
Better ideas are still welcome. |
PR description:
This PR adds overloads for the caching allocators to allocate memory without device-side ownership. These overloads do not take the CUDA stream as an argument. Deallocating such memory blocks from host makes them free immediately without calling
cudaEventRecord()
(according to VTunecudaEventRecord()
had the second-highest total waiting time for locking the mutex in CUDA API).I changed all
unique_ptr
's in the CUDADataFormats (that are used in the pixel tracking workflow) that are owned by the data format class to use these overloads. It works because the data format objects are destructed only after all relevant work in their CUDA streams have finished.This work was done during the NERSC-9 GPU hackathon at Cray offices. On Cori GPU node (V100) I got 14 % (20 % on 2 GPUs) increase in throughput for 2018D JetHT data.
The PR is RFC for two reasons
PR validation:
Profiling workflow runs, unit tests run.