Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg-fault in one HLT job of run-388769 (pixel clusters) #47021

Open
missirol opened this issue Dec 20, 2024 · 5 comments
Open

Seg-fault in one HLT job of run-388769 (pixel clusters) #47021

missirol opened this issue Dec 20, 2024 · 5 comments

Comments

@missirol
Copy link
Contributor

missirol commented Dec 20, 2024

Among the many HLT crashes in run-388769 (PbPb collisions in 2024), one looked different from the ones reported in #46783. It was a segmentation violation, and the original stack trace can be found here: old_hlt_run388769_pid4080142.log. It contains

Thread 15 (Thread 0x7fea91bff700 (LWP 4082750) "cmsRun"):
#0  0x00007feb217890e1 in poll () from /lib64/libc.so.6
#1  0x00007feb0c32e6e7 in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007feb0c32e8e4 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fea2aec8070 in SiPixelDigisClustersFromSoAAlpaka<pixelTopology::HIonPhase1>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so
#5  0x00007feb2421cca2 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#6  0x00007feb2421613c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

[...]

Current Modules:

Module: SiPixelDigisClustersFromSoAAlpakaHIonPhase1:hltSiPixelClustersPPOnAA (crashed)

Using the input file in question, a crash can be reproduced with the script in [1] (tested it on lxplus800, with GPU offloading enabled, using CMSSW_15_0_0_pre1; I used the latter pre-release just for convenience; the behavior is the same in 14_1_X, as far as I can see). It's worth noting that the output of the reproducer is not always the same: at times it crashes with an output like [2], while other times it ends with the same exception as in #46783.

The problem seems to be related to the pixel local reconstruction. I'm opening a separate issue in case the problem behind this crash is not exactly the same as the problem behind #46783.

[1]

#!/bin/bash

# cmsrel CMSSW_15_0_0_pre1
# cd CMSSW_15_0_0_pre1/src
# cmsenv

hltLabel=hlt
hltMenu=/dev/CMSSW_14_2_0/HIon/V11
globalTag=141X_dataRun3_HLT_v2

hltGetConfiguration \
  "${hltMenu}" \
  --globaltag "${globalTag}" \
  --data \
  --no-prescale \
  --no-output \
  --max-events 1 \
  --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run388769/run388769_ls0254_index000051_fu-c2b05-16-01_pid4080142.root \
  --path HLT_HIUPC_ZDC1nOR_MBHF1AND_PixelTrackMultiplicity40400_v* \
  > "${hltLabel}".py

cat <<@EOF >> "${hltLabel}".py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.options.accelerators = ['*']

del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')

process.source.skipEvents = cms.untracked.uint32( 64 )
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${hltLabel}".py &> "${hltLabel}".log

[2]

Begin processing the 1st record. Run 388769, Event 336591255, LumiSection 254 on stream 0 at 20-Dec-2024 19:13:29.268 CET
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- Begin Fatal Exception 20-Dec-2024 19:14:01 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 388769 lumi: 254 event: 336591255 stream: 0
   [1] Running path 'HLT_HIUPC_ZDC1nOR_MBHF1AND_PixelTrackMultiplicity40400_v2'
   [2] Calling method for module SiPixelRawToClusterHIonPhase1@alpaka/'hltSiPixelClustersPPOnAASoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------
20-Dec-2024 19:14:01 CET  Closed file root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run388769/run388769_ls0254_index000051_fu-c2b05-16-01_pid4080142.root
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_15_0_0_pre1-el8_amd64_gcc12/build/CMSSW_15_0_0_pre1-build/el8_amd64_gcc12/external/alpaka/1.1.0-8e7128ba865cc169d302ab17150849de/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 20, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign hlt, reconstruction, heterogeneous

@makortel
Copy link
Contributor

@cms-sw/trk-dpg-l2

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,reconstruction,heterogeneous

@fwyzard,@jfernan2,@makortel,@mandrenguyen,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants