Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all remaining calls to cudaStreamSynchronize() #109

Merged
merged 5 commits into from
Jul 31, 2018

Conversation

makortel
Copy link

This PR removes all of the remaining calls to cudaStreamSynchronize():

  • Replaced thrust::inclusive_scan with cub::DeviceScan::InclusiveSum in rechits (though moved to raw2cluster within the last bullet)
    • This required allocating a temporary buffer in device memory for the algorithm, but it is done once per EDProducer instance
    • On the other hand, the thrust::inclusive_scan seems to use the cub algorithm internally, and in addition to the implicit cudaStreamSynchronize() it also allocates+frees the buffer on each call (!)
      • it's visible in the profiles
  • In Raw2Cluster transfer the errors for the maximum number of modules
    • Relatively small number so doesn't make much of a difference
  • In rechits, replaced the calculation of total number of hits by calculating the total number of clusters in the raw2cluster
    • The total number of hits is used to both the host side memory allocation and memory transfers, so with @felicepantaleo we felt that it could be better to try rearrange calculations

In addition, I noticed that rechits was accessing a host memory after a cudaMemcpyAsync to there without synchronization. First I added the synchronization, and then (following a comment) moved the number shuffling to GPU. It feels a bit dumb though to transfer the 11 elements of a constexpr uint32_t array to device memory, but apparently something like that is needed.

Resolves #79.

No changes expected.

@fwyzard @felicepantaleo @VinInn

@fwyzard
Copy link

fwyzard commented Jul 31, 2018

Hi @makortel, sorry, could you fix the conflicts that arose from merging #105 ?

@makortel
Copy link
Author

Certainly.

@makortel
Copy link
Author

Rebased.

@fwyzard
Copy link

fwyzard commented Jul 31, 2018

Validation summary

Reference release CMSSW_10_2_0_pre6 at a674e1f
Development branch CMSSW_10_2_X_Patatrack at a8d41ef
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_0_pre6-PU25ns_102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre6-102X_upgrade2018_realistic_v7-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/6af96ae37b4a3e442238437f2b4dde4078644747/log .

@fwyzard
Copy link

fwyzard commented Jul 31, 2018

Workflows 10824.5 are broken also before this PR, probably due to #105.
Workflows 10824.8 show the usual small level of irreproducibility.

@fwyzard fwyzard merged commit d2ca336 into cms-patatrack:CMSSW_10_2_X_Patatrack Jul 31, 2018
@makortel
Copy link
Author

makortel commented Aug 1, 2018

Hmm, there is still something synchronizing the CUDA streams... Screenshots from the testing 10824 profile

Raw2cluster
image

Rechits
image

In both the acquire phase includes all the queued GPU activity.

In contrast, here is the raw2cluster picture from development
image

where clearly the acquire ends before the GPU activity ends.

@makortel
Copy link
Author

makortel commented Aug 1, 2018

And the reason is likely here (still to be verified)
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Asynchronous

  1. For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

So all cudaMemcpyAsync calls for GPU->CPU better go to cudaMallocHosted memory.

@fwyzard
Copy link

fwyzard commented Aug 1, 2018

Do we have copies from the gpu to the stack on the cpu, or only to the heap ?

@makortel
Copy link
Author

makortel commented Aug 1, 2018

I'm going through them, so far only heap (e.g. normal class member variable). I think we cleaned up the stack ones in the last hackathon.

@fwyzard
Copy link

fwyzard commented Aug 1, 2018

Can you check if #103 (sorry, not #111) has an impact ?
The documentation is not clear whether we should set the cudaDeviceMapHost flag to support pinned memory in the CUDA runtime.

@makortel
Copy link
Author

makortel commented Aug 1, 2018

Did you mean #103? (I cant think how #111 could have an effect).

At least in raw2cluster just changing the class member variables to ones allocated with cudaMallocHost did the trick.

@fwyzard
Copy link

fwyzard commented Aug 1, 2018

Yes, of course - sorry for the confusion.

@makortel
Copy link
Author

makortel commented Aug 1, 2018

#103 alone does not fix the issue.

@makortel
Copy link
Author

makortel commented Aug 1, 2018

Fix is in #112.

fwyzard pushed a commit that referenced this pull request Dec 14, 2018
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove all of the remaining calls to cudaStreamSynchronize() from the pixel "raw to cluster" workflow.

Replace thrust::inclusive_scan with cub::DeviceScan::InclusiveSum to avoid implicit cudaStreamSynchronize and per-event buffer allocations

Avoid a data dependency on the number of hits:
  - in raw2cluster, always transfer the errors for the maximum number of modules.
  - in rechits, replace the calculation of the total number of hits with the total number of clusters

Copy the phase1PixelTopology::layerStart array to the GPU to avoid an extra copy back and forth from the CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants