Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full workflow on GPU #197

Closed

Conversation

VinInn
Copy link

@VinInn VinInn commented Nov 18, 2018

With this PR the full workflow raw->vertices can be performed on GPU only....

more work is clearly needed: besides fixing conflicts...
this PR correspond to orange in
http://innocent.home.cern.ch/innocent/RelVal/ttbarPU50_Ideal_tkgpuFDR5/plots_pixel_pixel/effandfakePtEtaPhi.pdf
eff vs fakes can be tuned at will

@fwyzard
Copy link

fwyzard commented Dec 14, 2018

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 6f55d70
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/efc7ea0614f89ca603723c9f52ad6fb033b8049d/log .

@VinInn
Copy link
Author

VinInn commented Dec 14, 2018

this is not freed

cudaCheck(cudaMalloc((void **) & d_phase1TopologyLayer_, phase1PixelTopology::layer.size() * sizeof(uint8_t)));

@VinInn
Copy link
Author

VinInn commented Dec 14, 2018

"leak" fixed in #216

p.s. cannot access the logs plots etc

@fwyzard
Copy link

fwyzard commented Dec 14, 2018

p.s. cannot access the logs plots etc

they should appear shortly

@fwyzard fwyzard added the bug label Dec 18, 2018
backport fix from cms-sw#216
@fwyzard
Copy link

fwyzard commented Dec 18, 2018

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 6f55d70
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/4370cd08b17086ff421b3c9ea4e3ef1e4de6b18d/log .

fitter.launchKernels(hh, hh.nHits, CAConstants::maxNumberOfQuadruplets(), cudaStream);
kernels.classifyTuples(hh, gpu_, cudaStream);
}
if (transferToCPU) {
Copy link
Author

@VinInn VinInn Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of these three (or all of them) seems to produce a

Uninitialized access at 0x2aab2b009dc0 on access by cudaMemcopy source

what does it mean?
that we are transferring uninitialized memory?
yes, the memory is not zeroed (why should be?) and we fill much less tracks then CAConstants::maxNumberOfQuadruplets()
initializing memory for each event is out of discussion.
If silencing a false positive is a requirement we can memset these storage area after allocation

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that we are transferring uninitialized memory?

I think so, yes.

initializing memory for each event is out of discussion.

Agreed.

If silencing a false positive is a requirement ...

Making the tool happy is not a requirement, but I would say that silencing the error report is one - if only to be able to find the eventual real problems in the future.

... we can memset these storage area after allocation.

Sounds good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initchk silenced here c46e716
(on top of #216 and #236)

@VinInn
Copy link
Author

VinInn commented Dec 19, 2018

a D2H asyncMemcpy cannot crash a performance wf because is not executed.

@VinInn
Copy link
Author

VinInn commented Dec 19, 2018

My undestanding was that our conclusion about

========= Barrier error detected. Divergent thread(s) in warp

,in particular on V100, was that it is a false positive.

@fwyzard
Copy link

fwyzard commented Dec 19, 2018

My undestanding was that our conclusion about

========= Barrier error detected. Divergent thread(s) in warp

, in particular on V100, was that it is a false positive.

I don't think we can claim it is always a false positive, but I agree (and I opened a bug report with NVIDIA few weeks ago, though it haven't had much activity that I can see).

One possibility could be to run the tests on a P100 instead, and if that does not report any problems, assume the behaviour to be correct on the V100 as well. If @felicepantaleo or @makortel have any suggestions, they are welcome.

@VinInn
Copy link
Author

VinInn commented Dec 19, 2018

I run on workergpu13 performance wf workflow 32 threads 16 edm streams (on 4 gpus)
and zmumu crashes (50% of the time) not necessarely at first event: always in

CUDA error 33 [/data/user/fwyzard/patatrack/build/slc7_amd64_gcc700.patatrack/tmp/BUILDROOT/344a52fa455e34923c6647f1154d765f/opt/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_4_0_pre3_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 551]: invalid resource handle
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  invalid resource handle

ttbar pu50 as well (500 events only)
(but w/o the cuda error message)
logfiles in
/mnt/home/innocent/mc
fromtmp*.log pu50*.log

@VinInn
Copy link
Author

VinInn commented Dec 19, 2018

I got

Begin processing the 1301st record. Run 1, Event 12711, LumiSection 128 on stream 4 at 19-Dec-2018 14:04:48.670 EST
CUDA error 33 [/data/user/fwyzard/patatrack/build/slc7_amd64_gcc700.patatrack/tmp/BUILDROOT/97ed725b1553863373e93f508f3e9e35/opt/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_4_0_pre4_Patatrack/src/HeterogeneousCore/CUDAServices/src/CachingHostAllocator.h, 551]: invalid resource handle
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  invalid resource handle

running the perf wf (32thread,16streams) on zmumu from a MSSW_10_4_0_pre4_Patatrack virgin area.
log in /mnt/home/innocent/mc/refzmm.log

also one w/o cuda error

Begin processing the 401st record. Run 1, Event 10837, LumiSection 109 on stream 3 at 19-Dec-2018 14:15:51.164 EST


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Wed Dec 19 14:15:53 EST 2018

in refzmm_3.log

@makortel
Copy link

I finally managed to get a crash on cmg1080 with proper debug prints, and I think I have an idea what is going on (there is indeed a race for the CUDA event between HostFree() and HostAllocate()). I believe the CachingDeviceAllocator suffers from the same problem as well. Fixes follow shortly.

@makortel
Copy link

The fix is in #237.

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 68f320f
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/f2f0ba95b164f8f65ac3a363fd6e574f8f562fa1/log .

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Closed in favour of #216 to recover the higher throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants