Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix modulesToUnpack in raw2digi #208

Merged

Conversation

makortel
Copy link

@makortel makortel commented Dec 3, 2018

This PR fixes the problem of uninitialized memory reported in #172 (comment) and #172 (comment).

Basically the problem was that the "modules to unpack" was left uninitialized for all but first event.

@fwyzard @VinInn

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

@makortel vinavx2 seems to be down, so I cannot run the automated validation.
In the meantime, can you elaborate on what this fix, well, fixes things ?

@makortel
Copy link
Author

makortel commented Dec 3, 2018

gpuModulesToUnpack holds a bool (well, unsigned char) for each module to say whether that module should be unpacked or not.

Before #172, when the memory was allocated only once per job (per EDM stream), in case all modules are to be unpacked (regions_ == nullptr) the gpuModulesToUnpack was filled only when the FED cabling map was updated.

After #172, the gpuModulesToUnpack was, by mistake, still filled only when the FED cabling map was updated (in practice the first event for MC at least). Since now the memory is allocated (taken from the cache) for every event, from the second event onwards the memory was left uninitialized. In principle that should lead to random results. Doing a memset to 0 has the same effect as enabling all modules to be unpacked (if I read the code correct).

This PR is a minimal fix to fill the memory on every event.

Taking another look on the code now, it could be optimized a bit by doing the construction of the host-side array once (per cabling map update) and only transfer to device on every event (which still feels a bit stupid though).

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

OK, this PR does significantly impact the performance.

Before:

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1837.6 ±   2.2 ev/s (4000 events)
  1850.8 ±   2.7 ev/s (4000 events)
  1852.8 ±   2.1 ev/s (4000 events)
  1815.9 ±   3.0 ev/s (4000 events)

After:

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1402.8 ±   0.9 ev/s (4000 events)
  1408.3 ±   0.7 ev/s (4000 events)
  1407.9 ±   0.6 ev/s (4000 events)
  1397.5 ±   1.1 ev/s (4000 events)

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

In this case, I would say that we do want to allocate and fill gpuModulesToUnpack only once...

@makortel
Copy link
Author

makortel commented Dec 3, 2018

Yeah, I guess that was the reason why things got faster with 15c15ab in #172.

In this case, I would say that we do want to allocate and fill gpuModulesToUnpack only once...

I can take a look once vinavx2 gets back.

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

vinavx2 is in a sorry state :-(

in the meantime, I attempted a different fix (#209) - but it fails miserably running with multiple streams:

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 0 at 04-Dec-2018 12:39:58.318 CET
acquireGPUCuda for CUDA stream 0x2aaaf4595940
produceGPUCuda for CUDA stream 0x2aaaf4595940
acquireGPUCuda for CUDA stream 0x2aaaf45957b8
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu, line 542: cudaErrorInvalidValue: invalid argument

CachingHostAllocator.h, 542 is:

            // Insert the ready event in the associated stream (must have current device set properly)
            if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

Can you think of any reasons why the event or stream should be invalid ?

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

In fact, I get the very same error using this PR (compiled with -DCUB_STDERR), with 1 thread and 8 streams:

CUDAService fully initialized
...
Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 0 at 04-Dec-2018 13:54:15.374 CET
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
CUDA error 33 [/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 542]: invalid resource handle
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu, line 542: cudaErrorInvalidValue: invalid argument

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
...

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

One hint: even with multiple threads and edm streams, running on a single GPU (setting CUDA_VISIBLE_DEVICES=0) the problem disappears.

@makortel
Copy link
Author

makortel commented Dec 4, 2018

I have a fix for the multi-GPU crashes in #212 (should work both for this PR and #209).

@makortel
Copy link
Author

makortel commented Dec 4, 2018

The last commit allocates+fills+transfers the default case (unpack all modules) once within the EventSetup.

If it is confirmed to work, I'd prefer this PR over #209 because this PR continues to have beginStreamCUDA empty (which would be nice in the view of #100). #157 needs to be updated if this gets merged.

@fwyzard
Copy link

fwyzard commented Dec 5, 2018

Here's a comparison of the throughput and memory usage for:

reverting #172

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1795.2 ±   1.6 ev/s (4000 events)
  1803.6 ±   1.9 ev/s (4000 events)
  1799.5 ±   1.6 ev/s (4000 events)
  1796.5 ±   1.3 ev/s (4000 events)

I can run up to 11 concurrent streams on each V100:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3593.4             2.2        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3577.0             2.3        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3554.0             2.6        4000    99.6%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3524.1             2.0        4000    99.6%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job was killed by signal 6
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorGPU.cc, line 324: cudaErrorMemoryAllocation: out of memory

#208 + #212

Warming up
Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1764.0 ±   1.3 ev/s (4000 events)
  1756.2 ±   1.4 ev/s (4000 events)
  1756.3 ±   1.2 ev/s (4000 events)
  1752.7 ±   1.3 ev/s (4000 events)

Here as well I can run up to 11 concurrent streams on each V100:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3510.3             2.2        4000    99.8%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3504.2             1.9        4000    99.5%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3471.3             2.9        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3446.5             2.3        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job failed with return code 66
CUDA error 2 [/mnt/home/fwyzard/cmssw/slc7_amd64_gcc700/external/cub/1.8.0-ikaegh/include/cub/util_allocator.cuh, 446]: out of memory

#209 + #212

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1761.2 ±   1.6 ev/s (4000 events)
  1752.8 ±   1.5 ev/s (4000 events)
  1763.4 ±   1.5 ev/s (4000 events)
  1754.8 ±   1.6 ev/s (4000 events)

Same results regarding the number of streams:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3515.1             2.5        4000    99.2%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3496.2             1.8        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3477.3             2.3        4000    100.0%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3442.7             2.6        4000    100.0%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
The underlying cmsRun job failed with return code 66

So, the performance of the two fixes are basically the same.

As for running with multiple GPUs:

#13 0x00002aaac4728f93 in cub::CachingHostAllocator::HostAllocate (this=0x2aaad2b856b8, d_ptr=d_ptr@entry=0x2aae94df7fe0, bytes=bytes@entry=40, active_stream=0x2aab3b97b8f0) at /mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h:432

@makortel
Copy link
Author

makortel commented Dec 5, 2018

@fwyzard Could you provide the full output of

with #209 + #212 it failed once out of ten or so, with a segmentation fault from the host allocator:

?

@fwyzard fwyzard mentioned this pull request Dec 5, 2018
@fwyzard
Copy link

fwyzard commented Dec 5, 2018

I haven't been able to reproduce it, let's forget about it for the time being...

@fwyzard
Copy link

fwyzard commented Dec 7, 2018

Validation summary

Reference release CMSSW_10_4_0_pre2 at c6061f4
Development branch CMSSW_10_4_X_Patatrack at e5291a0
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre2-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre2-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/7bc063c729692affd3ca72d6fcdaedab4960ef11/log .

@fwyzard
Copy link

fwyzard commented Dec 7, 2018

The validation run at flatiron institute uses two GPUs, so most of the "development" workflows failed outright.

Most of the "testing" workflows succeeded, and recover the performance of the CPU workflows.
For example, TTbar 10824.8 gives (note that development-10824.8 failed):

  reference-10824.5 development-10824.5 testing-10824.8
Efficiency 0.4861 0.4852 0.4841
Number of TrackingParticles (after cuts) 5666 5666 5666
Number of matched TrackingParticles 2754 2749 2743
Fake rate 0.0537 0.0537 0.0358
Duplicate rate 0.0156 0.0151 0.0155
Number of tracks 32379 32390 31931
Number of true tracks 30641 30652 30787
Number of fake tracks 1738 1738 1144
Number of pileup tracks 26878 26878 26989
Number of duplicate tracks 504 488 496

image

@fwyzard
Copy link

fwyzard commented Dec 7, 2018

@VinInn @makortel do these number look OK to you ?
If you agree, I will go ahead and merge #208 and #212 .

@VinInn
Copy link

VinInn commented Dec 7, 2018 via email

@fwyzard fwyzard merged commit b4843b4 into cms-patatrack:CMSSW_10_4_X_Patatrack Dec 7, 2018
@fwyzard
Copy link

fwyzard commented Dec 7, 2018

Will finally merge then, thank you.

@fwyzard fwyzard added this to the CMSSW_10_4_0_pre3_Patatrack milestone Dec 7, 2018
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
…Heterogeneous (#208)

As an optimisation, move the default non-regional case to the EventSetup, and allocate,  fill and transfer event-by-event only for the regional case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants