Fix modulesToUnpack in raw2digi #208

makortel · 2018-12-03T18:33:10Z

This PR fixes the problem of uninitialized memory reported in #172 (comment) and #172 (comment).

Basically the problem was that the "modules to unpack" was left uninitialized for all but first event.

fwyzard · 2018-12-03T19:24:36Z

@makortel vinavx2 seems to be down, so I cannot run the automated validation.
In the meantime, can you elaborate on what this fix, well, fixes things ?

makortel · 2018-12-03T19:40:22Z

gpuModulesToUnpack holds a bool (well, unsigned char) for each module to say whether that module should be unpacked or not.

Before #172, when the memory was allocated only once per job (per EDM stream), in case all modules are to be unpacked (regions_ == nullptr) the gpuModulesToUnpack was filled only when the FED cabling map was updated.

After #172, the gpuModulesToUnpack was, by mistake, still filled only when the FED cabling map was updated (in practice the first event for MC at least). Since now the memory is allocated (taken from the cache) for every event, from the second event onwards the memory was left uninitialized. In principle that should lead to random results. Doing a memset to 0 has the same effect as enabling all modules to be unpacked (if I read the code correct).

This PR is a minimal fix to fill the memory on every event.

Taking another look on the code now, it could be optimized a bit by doing the construction of the host-side array once (per cabling map update) and only transfer to device on every event (which still feels a bit stupid though).

fwyzard · 2018-12-03T21:04:46Z

OK, this PR does significantly impact the performance.

Before:

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1837.6 ±   2.2 ev/s (4000 events)
  1850.8 ±   2.7 ev/s (4000 events)
  1852.8 ±   2.1 ev/s (4000 events)
  1815.9 ±   3.0 ev/s (4000 events)

After:

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1402.8 ±   0.9 ev/s (4000 events)
  1408.3 ±   0.7 ev/s (4000 events)
  1407.9 ±   0.6 ev/s (4000 events)
  1397.5 ±   1.1 ev/s (4000 events)

fwyzard · 2018-12-03T21:07:39Z

In this case, I would say that we do want to allocate and fill gpuModulesToUnpack only once...

makortel · 2018-12-03T21:13:23Z

Yeah, I guess that was the reason why things got faster with 15c15ab in #172.

In this case, I would say that we do want to allocate and fill gpuModulesToUnpack only once...

I can take a look once vinavx2 gets back.

fwyzard · 2018-12-04T11:42:26Z

vinavx2 is in a sorry state :-(

in the meantime, I attempted a different fix (#209) - but it fails miserably running with multiple streams:

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 0 at 04-Dec-2018 12:39:58.318 CET
acquireGPUCuda for CUDA stream 0x2aaaf4595940
produceGPUCuda for CUDA stream 0x2aaaf4595940
acquireGPUCuda for CUDA stream 0x2aaaf45957b8
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu, line 542: cudaErrorInvalidValue: invalid argument

CachingHostAllocator.h, 542 is:

            // Insert the ready event in the associated stream (must have current device set properly)
            if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

Can you think of any reasons why the event or stream should be invalid ?

fwyzard · 2018-12-04T12:56:34Z

In fact, I get the very same error using this PR (compiled with -DCUB_STDERR), with 1 thread and 8 streams:

CUDAService fully initialized
...
Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 0 at 04-Dec-2018 13:54:15.374 CET
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu, line 542: cudaErrorInvalidValue: invalid argument

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
...

fwyzard · 2018-12-04T13:04:50Z

One hint: even with multiple threads and edm streams, running on a single GPU (setting CUDA_VISIBLE_DEVICES=0) the problem disappears.

makortel · 2018-12-04T19:29:44Z

I have a fix for the multi-GPU crashes in #212 (should work both for this PR and #209).

…r event-by-event only for regionality

makortel · 2018-12-04T21:47:25Z

The last commit allocates+fills+transfers the default case (unpack all modules) once within the EventSetup.

If it is confirmed to work, I'd prefer this PR over #209 because this PR continues to have beginStreamCUDA empty (which would be nice in the view of #100). #157 needs to be updated if this gets merged.

fwyzard · 2018-12-05T00:58:48Z

Here's a comparison of the throughput and memory usage for:

reverting Add infrastructure around cub CachingDeviceAllocator, and use it in SiPixelRawToCluster #172
Fix modulesToUnpack in raw2digi #208 + Fix CachingHostAllocator for multiple GPUs #212
Cache the SiPixelFedCablingMapGPU across events #209 + Fix CachingHostAllocator for multiple GPUs #212

reverting #172

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1795.2 ±   1.6 ev/s (4000 events)
  1803.6 ±   1.9 ev/s (4000 events)
  1799.5 ±   1.6 ev/s (4000 events)
  1796.5 ±   1.3 ev/s (4000 events)

I can run up to 11 concurrent streams on each V100:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3593.4             2.2        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3577.0             2.3        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3554.0             2.6        4000    99.6%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3524.1             2.0        4000    99.6%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job was killed by signal 6
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorGPU.cc, line 324: cudaErrorMemoryAllocation: out of memory

#208 + #212

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1764.0 ±   1.3 ev/s (4000 events)
  1756.2 ±   1.4 ev/s (4000 events)
  1756.3 ±   1.2 ev/s (4000 events)
  1752.7 ±   1.3 ev/s (4000 events)

Here as well I can run up to 11 concurrent streams on each V100:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3510.3             2.2        4000    99.8%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3504.2             1.9        4000    99.5%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3471.3             2.9        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3446.5             2.3        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job failed with return code 66
CUDA error 2 [/mnt/home/fwyzard/cmssw/slc7_amd64_gcc700/external/cub/1.8.0-ikaegh/include/cub/util_allocator.cuh, 446]: out of memory

#209 + #212

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1761.2 ±   1.6 ev/s (4000 events)
  1752.8 ±   1.5 ev/s (4000 events)
  1763.4 ±   1.5 ev/s (4000 events)
  1754.8 ±   1.6 ev/s (4000 events)

Same results regarding the number of streams:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3515.1             2.5        4000    99.2%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3496.2             1.8        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3477.3             2.3        4000    100.0%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3442.7             2.6        4000    100.0%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job failed with return code 66

So, the performance of the two fixes are basically the same.

As for running with multiple GPUs:

with Fix modulesToUnpack in raw2digi #208 + Fix CachingHostAllocator for multiple GPUs #212 it seems to work
with Cache the SiPixelFedCablingMapGPU across events #209 + Fix CachingHostAllocator for multiple GPUs #212 it failed once out of ten or so, with a segmentation fault from the host allocator:

#13 0x00002aaac4728f93 in cub::CachingHostAllocator::HostAllocate (this=0x2aaad2b856b8, d_ptr=d_ptr@entry=0x2aae94df7fe0, bytes=bytes@entry=40, active_stream=0x2aab3b97b8f0) at /mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h:432

makortel · 2018-12-05T14:53:18Z

@fwyzard Could you provide the full output of

with #209 + #212 it failed once out of ten or so, with a segmentation fault from the host allocator:

?

fwyzard · 2018-12-05T22:46:53Z

I haven't been able to reproduce it, let's forget about it for the time being...

fwyzard · 2018-12-07T09:59:05Z

Validation summary

Reference release CMSSW_10_4_0_pre2 at c6061f4
Development branch CMSSW_10_4_X_Patatrack at e5291a0
Testing PRs:

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- ⚠️ profile.py: log, profile and summary are missing, see the full log for more information
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found no CUDA-MEMCHECK results
- ❌ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) found no CUDA-MEMCHECK results
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
development release, workflow 10824.7
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found no CUDA-MEMCHECK results
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- ⚠️ profile.py: log, profile and summary are missing, see the full log for more information
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found no CUDA-MEMCHECK results
- ❌ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) found no CUDA-MEMCHECK results
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
development release, workflow 10824.7
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found no CUDA-MEMCHECK results
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ❌ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) found no CUDA-MEMCHECK results
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/7bc063c729692affd3ca72d6fcdaedab4960ef11/log .

fwyzard · 2018-12-07T13:23:48Z

The validation run at flatiron institute uses two GPUs, so most of the "development" workflows failed outright.

Most of the "testing" workflows succeeded, and recover the performance of the CPU workflows.
For example, TTbar 10824.8 gives (note that development-10824.8 failed):

	reference-10824.5	development-10824.5	testing-10824.8
Efficiency	0.4861	0.4852	0.4841
Number of TrackingParticles (after cuts)	5666	5666	5666
Number of matched TrackingParticles	2754	2749	2743
Fake rate	0.0537	0.0537	0.0358
Duplicate rate	0.0156	0.0151	0.0155
Number of tracks	32379	32390	31931
Number of true tracks	30641	30652	30787
Number of fake tracks	1738	1738	1144
Number of pileup tracks	26878	26878	26989
Number of duplicate tracks	504	488	496

fwyzard · 2018-12-07T13:24:24Z

@VinInn @makortel do these number look OK to you ?
If you agree, I will go ahead and merge #208 and #212 .

VinInn · 2018-12-07T13:26:50Z

On 7 Dec, 2018, at 2:24 PM, Andrea Bocci ***@***.***> wrote: @VinInn @makortel do these number look OK to you ? If you agree, I will go ahead and merge #208 and #212 .

yes, make sense to me. v.

fwyzard · 2018-12-07T13:52:49Z

Will finally merge then, thank you.

…Heterogeneous (#208) As an optimisation, move the default non-regional case to the EventSetup, and allocate, fill and transfer event-by-event only for the regional case.

Fix initialization of ModulesToUnpack

57d1599

makortel mentioned this pull request Dec 3, 2018

Add infrastructure around cub CachingDeviceAllocator, and use it in SiPixelRawToCluster #172

Merged

makortel mentioned this pull request Dec 4, 2018

Fix CachingHostAllocator for multiple GPUs #212

Merged

Optimize: move default non-regional case to ES, allocate+fill+transfe…

89d2076

…r event-by-event only for regionality

makortel force-pushed the fixModulesToUnpack branch from 7133aa3 to 89d2076 Compare December 4, 2018 21:43

fwyzard mentioned this pull request Dec 5, 2018

Full workflow on GPU #197

Closed

fwyzard mentioned this pull request Dec 7, 2018

Cache the SiPixelFedCablingMapGPU across events #209

Closed

fwyzard merged commit b4843b4 into cms-patatrack:CMSSW_10_4_X_Patatrack Dec 7, 2018

fwyzard added this to the CMSSW_10_4_0_pre3_Patatrack milestone Dec 7, 2018

fwyzard added bug fixed labels Dec 7, 2018

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel local reconstruction (9/N) cms-sw/cmssw#31721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix modulesToUnpack in raw2digi #208

Fix modulesToUnpack in raw2digi #208

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Dec 3, 2018

fwyzard commented Dec 4, 2018

fwyzard commented Dec 4, 2018 •

edited

Loading

fwyzard commented Dec 4, 2018

makortel commented Dec 4, 2018

makortel commented Dec 4, 2018

fwyzard commented Dec 5, 2018 •

edited

Loading

makortel commented Dec 5, 2018

fwyzard commented Dec 5, 2018

fwyzard commented Dec 7, 2018 •

edited

Loading

fwyzard commented Dec 7, 2018 •

edited

Loading

fwyzard commented Dec 7, 2018

VinInn commented Dec 7, 2018 via email

fwyzard commented Dec 7, 2018 •

edited

Loading

Fix modulesToUnpack in raw2digi #208

Fix modulesToUnpack in raw2digi #208

Conversation

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

makortel commented Dec 3, 2018

fwyzard commented Dec 4, 2018

fwyzard commented Dec 4, 2018 • edited Loading

fwyzard commented Dec 4, 2018

makortel commented Dec 4, 2018

makortel commented Dec 4, 2018

fwyzard commented Dec 5, 2018 • edited Loading

reverting #172

#208 + #212

#209 + #212

makortel commented Dec 5, 2018

fwyzard commented Dec 5, 2018

fwyzard commented Dec 7, 2018 • edited Loading

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Dec 7, 2018 • edited Loading

fwyzard commented Dec 7, 2018

VinInn commented Dec 7, 2018 via email

fwyzard commented Dec 7, 2018 • edited Loading

fwyzard commented Dec 4, 2018 •

edited

Loading

fwyzard commented Dec 5, 2018 •

edited

Loading

fwyzard commented Dec 7, 2018 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

fwyzard commented Dec 7, 2018 •

edited

Loading

fwyzard commented Dec 7, 2018 •

edited

Loading