Remove all remaining calls to cudaStreamSynchronize() #109

makortel · 2018-07-31T09:47:13Z

This PR removes all of the remaining calls to cudaStreamSynchronize():

Replaced thrust::inclusive_scan with cub::DeviceScan::InclusiveSum in rechits (though moved to raw2cluster within the last bullet)
- This required allocating a temporary buffer in device memory for the algorithm, but it is done once per EDProducer instance
- On the other hand, the thrust::inclusive_scan seems to use the cub algorithm internally, and in addition to the implicit cudaStreamSynchronize() it also allocates+frees the buffer on each call (!)
  - it's visible in the profiles
In Raw2Cluster transfer the errors for the maximum number of modules
- Relatively small number so doesn't make much of a difference
In rechits, replaced the calculation of total number of hits by calculating the total number of clusters in the raw2cluster
- The total number of hits is used to both the host side memory allocation and memory transfers, so with @felicepantaleo we felt that it could be better to try rearrange calculations

In addition, I noticed that rechits was accessing a host memory after a cudaMemcpyAsync to there without synchronization. First I added the synchronization, and then (following a comment) moved the number shuffling to GPU. It feels a bit dumb though to transfer the 11 elements of a constexpr uint32_t array to device memory, but apparently something like that is needed.

Resolves #79.

No changes expected.

@fwyzard @felicepantaleo @VinInn

fwyzard · 2018-07-31T10:11:37Z

Hi @makortel, sorry, could you fix the conflicts that arose from merging #105 ?

makortel · 2018-07-31T10:36:20Z

Certainly.

…avoid implicit cudaStreamSynchronize

…ize()

…Synchronize() We need the number of hits/clusters in RecHit acquire(), but fortunately we can do the necessary calculation already in Raw2Cluster.

makortel · 2018-07-31T11:04:42Z

Rebased.

fwyzard · 2018-07-31T18:23:01Z

Validation summary

Reference release CMSSW_10_2_0_pre6 at a674e1f
Development branch CMSSW_10_2_X_Patatrack at a8d41ef
Testing PRs:

Remove all remaining calls to cudaStreamSynchronize() #109 at 98ee4f4

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

tracking validation plots for workflow 10824.5
tracking validation plots for workflow 10824.8
tracking validation plots for workflow 10824.7
tracking validation plots for workflow 10824.9

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

tracking validation plots for workflow 10824.5
tracking validation plots for workflow 10824.8
tracking validation plots for workflow 10824.7
tracking validation plots for workflow 10824.9

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

reference DQM plots for reference release, workflow 10824.5
DQM plots for development release, workflow 10824.5
DQM plots for development release, workflow 10824.8
DQM plots for development release, workflow 10824.7
DQM plots for development release, workflow 10824.9
DQM plots for testing release, workflow 10824.5
DQM plots for testing release, workflow 10824.8
DQM plots for testing release, workflow 10824.7
DQM plots for testing release, workflow 10824.9
DQM comparison for reference workflow 10824.5
DQM comparison for workflow 10824.8
DQM comparison for workflow 10824.7
DQM comparison for workflow 10824.9

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

reference DQM plots for reference release, workflow 10824.5
DQM plots for development release, workflow 10824.5
DQM plots for development release, workflow 10824.8
DQM plots for development release, workflow 10824.7
DQM plots for development release, workflow 10824.9
DQM plots for testing release, workflow 10824.5
DQM plots for testing release, workflow 10824.8
DQM plots for testing release, workflow 10824.7
DQM plots for testing release, workflow 10824.9
DQM comparison for reference workflow 10824.5
DQM comparison for workflow 10824.8
DQM comparison for workflow 10824.7
DQM comparison for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: the python configuration was not created, see the full log for more information
development release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: the python configuration was not created, see the full log for more information
development release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/6af96ae37b4a3e442238437f2b4dde4078644747/log .

fwyzard · 2018-07-31T20:03:16Z

Workflows 10824.5 are broken also before this PR, probably due to #105.
Workflows 10824.8 show the usual small level of irreproducibility.

makortel · 2018-08-01T08:58:34Z

Hmm, there is still something synchronizing the CUDA streams... Screenshots from the testing 10824 profile

Raw2cluster

Rechits

In both the acquire phase includes all the queued GPU activity.

In contrast, here is the raw2cluster picture from development

where clearly the acquire ends before the GPU activity ends.

makortel · 2018-08-01T11:48:27Z

And the reason is likely here (still to be verified)
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Asynchronous

For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

So all cudaMemcpyAsync calls for GPU->CPU better go to cudaMallocHosted memory.

fwyzard · 2018-08-01T11:59:27Z

Do we have copies from the gpu to the stack on the cpu, or only to the heap ?

makortel · 2018-08-01T12:02:17Z

I'm going through them, so far only heap (e.g. normal class member variable). I think we cleaned up the stack ones in the last hackathon.

fwyzard · 2018-08-01T13:04:33Z

Can you check if #103 (sorry, not #111) has an impact ?
The documentation is not clear whether we should set the cudaDeviceMapHost flag to support pinned memory in the CUDA runtime.

makortel · 2018-08-01T13:07:53Z

Did you mean #103? (I cant think how #111 could have an effect).

At least in raw2cluster just changing the class member variables to ones allocated with cudaMallocHost did the trick.

fwyzard · 2018-08-01T13:15:33Z

Yes, of course - sorry for the confusion.

makortel · 2018-08-01T13:28:27Z

#103 alone does not fix the issue.

makortel · 2018-08-01T13:28:50Z

Fix is in #112.

Applied comments

Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.

makortel mentioned this pull request Jul 31, 2018

Make RecHits fully asynchronous #79

Closed

fwyzard added the enhancement label Jul 31, 2018

fwyzard added this to the CMSSW_10_2_0_Patatrack milestone Jul 31, 2018

fwyzard assigned makortel Jul 31, 2018

makortel added 5 commits July 31, 2018 12:44

Remove last cudaStreamSynchronize from raw2cluster

e3d60ef

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to …

bb67000

…avoid implicit cudaStreamSynchronize

Fix a race condition with a cudaStreamSynchronize

5592a5c

Moved moduleStart->layerStart code to GPU to avoid cudaStreamSynchron…

6713272

…ize()

Move hitsModuleStart from RecHit to Raw2Cluster to avoid a cudaStream…

98ee4f4

…Synchronize() We need the number of hits/clusters in RecHit acquire(), but fortunately we can do the necessary calculation already in Raw2Cluster.

makortel force-pushed the removeSynch branch from 184d50c to 98ee4f4 Compare July 31, 2018 11:04

fwyzard merged commit d2ca336 into cms-patatrack:CMSSW_10_2_X_Patatrack Jul 31, 2018

makortel mentioned this pull request Aug 1, 2018

Make all device->host cudaMemcpyAsync's to use pinned memory #112

Merged

fwyzard pushed a commit that referenced this pull request Dec 14, 2018

Merge pull request #109 from MRD2F/CMSSW_10_4_0_pre2_DNN

9679b94

Applied comments

makortel mentioned this pull request Dec 19, 2018

Fix setting the data pointer of error SimpleVector #236

Merged

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel local reconstruction (9/N) cms-sw/cmssw#31721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all remaining calls to cudaStreamSynchronize() #109

Remove all remaining calls to cudaStreamSynchronize() #109

makortel commented Jul 31, 2018

fwyzard commented Jul 31, 2018

makortel commented Jul 31, 2018

makortel commented Jul 31, 2018

fwyzard commented Jul 31, 2018

fwyzard commented Jul 31, 2018

makortel commented Aug 1, 2018

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018 •

edited

Loading

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018

makortel commented Aug 1, 2018

makortel commented Aug 1, 2018

Remove all remaining calls to cudaStreamSynchronize() #109

Remove all remaining calls to cudaStreamSynchronize() #109

Conversation

makortel commented Jul 31, 2018

fwyzard commented Jul 31, 2018

makortel commented Jul 31, 2018

makortel commented Jul 31, 2018

fwyzard commented Jul 31, 2018

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Jul 31, 2018

makortel commented Aug 1, 2018

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018 • edited Loading

makortel commented Aug 1, 2018

fwyzard commented Aug 1, 2018

makortel commented Aug 1, 2018

makortel commented Aug 1, 2018

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

fwyzard commented Aug 1, 2018 •

edited

Loading