Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add infrastructure around cub CachingDeviceAllocator, and use it in SiPixelRawToCluster #172

Merged
merged 19 commits into from
Nov 27, 2018

Conversation

makortel
Copy link

@makortel makortel commented Sep 24, 2018

This PR experiments with the cub CachingDeviceAllocator (following the discussion in #138):

  • The CachingDeviceAllocator gets called via CUDAService, and the interface returns a unique_ptr
    • The allocator parameters can be tuned via CUDAService configuration parameters
  • As an experiment, I tested the approach in Raw2Cluster for both the temporary working space and for event data
    • For the event data I took a first step for better placement of the CUDA data formats (under CUDADataFormats/<same sub-package as in DataFormats>) by moving the digi and cluster "products" there (they are still aggregated to a single "GPUProduct" though)
  • As a further experiment, a CachingHostAllocator is added (based on CachingDeviceAllocator) for pinned host memory, and it is used in Raw2Cluster

There are many details that can (and maybe should) be discussed.

No changes expected.

@VinInn @fwyzard @felicepantaleo @rovere

@cmsbot
Copy link

cmsbot commented Sep 24, 2018

A new Pull Request was created by @makortel (Matti Kortelainen) for CMSSW_10_2_X_Patatrack.

It involves the following packages:

CUDADataFormats/Common
CUDADataFormats/SiPixelCluster
CUDADataFormats/SiPixelDigi
HeterogeneousCore/CUDAServices
RecoLocalTracker/SiPixelClusterizer
RecoLocalTracker/SiPixelRecHits
SimTracker/TrackerHitAssociation

The following packages do not have a category, yet:

CUDADataFormats/Common
CUDADataFormats/SiPixelCluster
CUDADataFormats/SiPixelDigi
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbot, @fwyzard can you please review it and eventually sign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Author

Then some random thoughts based on the prototype

  • For the event data I feel that a caching allocator for pinned host memory (cudaMallocHost/cudaHostAlloc) could be useful as well
    • Otherwise (given the experience of Speed up CPU side of GPU rechits #125) the pinned memory has to be owned by the EDModule
    • AFAICT cub does not have one (although it is trivial to copy-paste the device allocator)
  • A good SoA abstraction would be useful to reduce copy-paste (eventually for both device and pinned host memory)

@fwyzard

This comment has been minimized.

@makortel
Copy link
Author

The last commit should fix the leaks (by really releasing the cached memory).

CUB's tendency to "ignore" CUDA errors (or, breaking out of a loop without saying anything unless recompiled with -DCUB_STDERR) didn't really help debugging...

@fwyzard

This comment has been minimized.

@makortel

This comment has been minimized.

@makortel

This comment has been minimized.

@fwyzard
Copy link

fwyzard commented Sep 28, 2018

Tested with various configurations, running over 4000 real data events.
No changes in performance observed.

@fwyzard
Copy link

fwyzard commented Sep 30, 2018

Validation summary

Reference release CMSSW_10_2_5 at a8a031d
Development branch CMSSW_10_2_X_Patatrack at 58a5ecb
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_2-PU25ns_102X_upgrade2018_realistic_v11-v2/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_2-102X_upgrade2018_realistic_v11-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_2-PU25ns_102X_upgrade2018_realistic_v11-v2/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_2-102X_upgrade2018_realistic_v11-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_2-PU25ns_102X_upgrade2018_realistic_v11-v2/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_2-102X_upgrade2018_realistic_v11-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/d3aa4432e3a66fac98d59095492a81fe27dcd608/log .

@fwyzard
Copy link

fwyzard commented Sep 30, 2018

From the validation point of view, the PR is ready to go in.
@VinInn @felicepantaleo @rovere @makortel how do we want to proceed ?

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

Unfortunately, looks like this PR introduced a large tracking inefficiency:

  10.4.0-pre2 running 10824.5 #201 running 10824.8 #172 running 10824.8
Efficiency 0.4852 0.4841 0.2187
Number of TrackingParticles (after cuts) 5666 5666 5666
Number of matched TrackingParticles 2749 2743 1239
Fake rate 0.0537 0.0359 0.0347
Duplicate rate 0.0151 0.0153 0.0130
Number of tracks 32390 31928 14567
Number of true tracks 30652 30782 14061
Number of fake tracks 1738 1146 506
Number of pileup tracks 26878 26985 12377
Number of duplicate tracks 488 490 189

Lesson learned: never merge without re-running the validation on the latest commits...

@makortel , do you have some suggestions where to look ?

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

image

@makortel
Copy link
Author

makortel commented Dec 3, 2018

do you have some suggestions where to look ?

Not really, I'll take a look (as well).

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

Mhm, here is the result of zeroing all memory in the allocator before returning it to the requestors:

  reference pre-#172 #172 #172 with zeroing
Efficiency 0.4852 0.4841 0.2187 0.4841
Number of TrackingParticles (after cuts) 5666 5666 5666 5666
Number of matched TrackingParticles 2749 2743 1239 2743
Fake rate 0.0537 0.0359 0.0347 0.0358
Duplicate rate 0.0151 0.0153 0.0130 0.0155
Number of tracks 32390 31928 14567 31928
Number of true tracks 30652 30782 14061 30786
Number of fake tracks 1738 1146 506 1142
Number of pileup tracks 26878 26985 12377 26988
Number of duplicate tracks 488 490 189 495

Looks like some some kernel is not properly initialising its memory ?

@makortel
Copy link
Author

makortel commented Dec 3, 2018

The cause lies in the commit 15c15ab (that had a bit mysterious behaviour also earlier). Running a test before gives the ~30k tracks, with it ~15k.

@makortel
Copy link
Author

makortel commented Dec 3, 2018

Fix is here #208.

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

Alternative fix is #209.

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

Fall back solution is reverting #172.

fwyzard pushed a commit that referenced this pull request Oct 8, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard added a commit that referenced this pull request Nov 28, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard added a commit that referenced this pull request Dec 26, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
…iPixelRawToCluster (#172)

Add infrastructure around cub CachingDeviceAllocator for device
memory allocations, and CachingHostAllocator for pinned (or managed)
host memory.

CUDAService uses the CachingHostAllocator to allocate requested
GPU->CPU/CPU->GPU buffers and data products.
Configuration options can be used to request:
  - to print all memory (re)allocations and frees;
  - to preallocate device and host buffers.

SiPixelRawToCluster uses the CachingDeviceAllocator for temporary
buffers and data products.

Fix a memory problem with SiPixelFedCablingMapGPUWrapper::ModulesToUnpack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants