-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove all remaining calls to cudaStreamSynchronize() #109
Remove all remaining calls to cudaStreamSynchronize() #109
Conversation
Certainly. |
…avoid implicit cudaStreamSynchronize
…Synchronize() We need the number of hits/clusters in RecHit acquire(), but fortunately we can do the necessary calculation already in Raw2Cluster.
Rebased. |
Validation summaryReference release CMSSW_10_2_0_pre6 at a674e1f
|
Workflows 10824.5 are broken also before this PR, probably due to #105. |
And the reason is likely here (still to be verified)
So all |
Do we have copies from the gpu to the stack on the cpu, or only to the heap ? |
I'm going through them, so far only heap (e.g. normal class member variable). I think we cleaned up the stack ones in the last hackathon. |
Yes, of course - sorry for the confusion. |
#103 alone does not fix the issue. |
Fix is in #112. |
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow. Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations Avoid a data dependency on the number of hits: - in raw2cluster, always transfer the errors for the maximum number of modules. - in rechits, replace the calculation of the total number of hits with the total number of clusters Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
This PR removes all of the remaining calls to
cudaStreamSynchronize()
:thrust::inclusive_scan
withcub::DeviceScan::InclusiveSum
in rechits (though moved to raw2cluster within the last bullet)thrust::inclusive_scan
seems to use thecub
algorithm internally, and in addition to the implicitcudaStreamSynchronize()
it also allocates+frees the buffer on each call (!)In addition, I noticed that rechits was accessing a host memory after a
cudaMemcpyAsync
to there without synchronization. First I added the synchronization, and then (following a comment) moved the number shuffling to GPU. It feels a bit dumb though to transfer the 11 elements of aconstexpr uint32_t
array to device memory, but apparently something like that is needed.Resolves #79.
No changes expected.
@fwyzard @felicepantaleo @VinInn