Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Jobs with failures in CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut #46757

Open
LinaresToine opened this issue Nov 21, 2024 · 10 comments

Comments

@LinaresToine
Copy link

Hello all,

I am opening this issue for further investigation of the ongoing Long Jobs that have been seen at Tier 0. The original report is in cms-talk, where @makortel suggests an issue with module CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut taking too long with some events.

The report features a paused job with logs and tarballs of a couple of retries in
/eos/user/c/cmst0/public/PausedJobs/HIRun2024A/LongJob/job_2777101

A long job from HIRun2024B, with a failure in the same module, has also been retried twice. The tarball for such job is in
/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz

Could an expert please take a look?

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 21, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @LinaresToine.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

@cms-sw/tracking-pog-l2

@makortel
Copy link
Contributor

makortel commented Nov 21, 2024

The report features a paused job with logs and tarballs of a couple of retries in
/eos/user/c/cmst0/public/PausedJobs/HIRun2024A/LongJob/job_2777101

In this case it looked like the event 136410066:93:388171 took very long time (more than 2 days) to process (see https://cms-talk.web.cern.ch/t/paused-promptreco-job-for-run-388171-stuck-for-40h-after-disconnection/66372/2 for more details). In both cases the job was killed while only CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut was running, with the following stack trace

Thread 9 (Thread 0x147c4d1ff700 (LWP 1098) "cmsRun"):
#0  0x0000147ca6790098 in nanosleep () from /lib64/libc.so.6
#1  0x0000147ca678ff9e in sleep () from /lib64/libc.so.6
#2  0x0000147ca2e201d0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000147c28cbec84 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#5  0x0000147c28cbea65 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#6  0x0000147c28cc01ca in TrajectorySegmentBuilder::segments(TrajectoryStateOnSurface) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#7  0x0000147c28cc0a8d in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#8  0x0000147c28cc181a in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#9  0x0000147c28cb5e44 in GroupedCkfTrajectoryBuilder::rebuildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#10 0x0000147c2ce499a8 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#11 0x0000147c2ce4a89b in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#12 0x0000147ca9458513 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x0000147ca943fa6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x0000147ca93c42c9 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x0000147ca93c47c1 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x0000147ca9635238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#17 0x0000147ca878bb3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x147bf1411200, waiter=..., this=0x147ca51ce780) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x147ca51ce780) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#20 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#21 0x0000147ca878dcee in tbb::detail::r1::rml::private_worker::run (this=0x147ca31d5f00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#22 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x147ca31d5f00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#23 0x0000147ca6a661ca in start_thread () from /lib64/libpthread.so.0
#24 0x0000147ca66c18d3 in clone () from /lib64/libc.so.6

@makortel
Copy link
Contributor

A long job from HIRun2024B, with a failure in the same module, has also been retried twice. The tarball for such job is in
/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz

Also here only CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut was still running at the time of the kill, with a stack trace of

hread 5 (Thread 0x149ff6fff700 (LWP 1167) "cmsRun"):
#0  0x000014a04fd3d098 in nanosleep () from /lib64/libc.so.6
#1  0x000014a04fd3cf9e in sleep () from /lib64/libc.so.6
#2  0x000014a04c0351d0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000149fcf2bac8e in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#5  0x0000149fcf2baa65 in TrajectorySegmentBuilder::addGroup(TempTrajectory const&, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >, __gnu_cxx::__normal_iterator<TrajectoryMeasurementGroup const*, std::vector<TrajectoryMeasurementGroup, std::allocator<TrajectoryMeasurementGroup> > >) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#6  0x0000149fcf2bc1ca in TrajectorySegmentBuilder::segments(TrajectoryStateOnSurface) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#7  0x0000149fcf2bca8d in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#8  0x0000149fcf2bd81a in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#9  0x0000149fcf2b1e44 in GroupedCkfTrajectoryBuilder::rebuildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/pluginRecoTrackerCkfPatternPlugins.so
#10 0x0000149fd33fb9a8 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#11 0x0000149fd33fc89b in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so
#12 0x000014a0527f7513 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x000014a0527dea6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x000014a0527632c9 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x000014a0527637c1 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x000014a0524e2238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_5/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#17 0x000014a050ed3b3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x14a04e432a00, waiter=..., this=0x14a04e5ca700) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x14a04e5ca700) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#20 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#21 0x000014a050ed5cee in tbb::detail::r1::rml::private_worker::run (this=0x14a04c3f2f80) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#22 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x14a04c3f2f80) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#23 0x000014a0500131ca in start_thread () from /lib64/libpthread.so.0
#24 0x000014a04fc6e8d3 in clone () from /lib64/libc.so.6

In this case the long-running event seems to be 171352672:140:388306.

@makortel
Copy link
Contributor

@cms-sw/reconstruction-l2 @cms-sw/tracking-pog-l2 Could you someone take a look of what is going on with these long-running events?

@slava77
Copy link
Contributor

slava77 commented Dec 12, 2024

/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz

the file does not exist

ls /eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/

PromptReco_Run388306_HIPhysicsRawPrime44  PromptReco_Run388369_HIPhysicsRawPrime30  PromptReco_Run388369_HIPhysicsRawPrime49

how many events or lumisections are in this job?
The stream rate HIPhysicsRawPrime looks like 500Hz, so it would be 10K ev/lumi. A few very central events with some muons can easily accumulate and take a while to process.
I don't see any pp_on_AA modifications in seeding or building in RecoTracker/IterativeTracking/python/MuonSeededStep_cff.py ; so, this could be slow

@LinaresToine
Copy link
Author

LinaresToine commented Dec 13, 2024

We have notice many paused jobs with the same pattern, and all tarballs have been moved to

/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/ForceDisconnect-Socket/

The missing tarball mentioned by @slava77 is there as well:

/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/ForceDisconnect-Socket/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz

Regarding this job, it has 12204 events, and I only see one lumisection.

I'd also like to bring to attention one of the many more occurrences. This one apparently used two RAW files, one after the other, and exhibited long periods of time of inactivity, which is consistent with the processing of heavy events according to @makortel. The tarball of this job has been isolated in

/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/ForceDisconnect-Socket/sample-two-files/2d122085-a66f-4978-8fd8-c5d9b595887a-6-3-logArchive.tar.gz

So far we have 71 paused jobs that match the pattern studied in this issue, and all input files add up to approximately 400 thousand events.
Finally, for further study, some of the raw files have been placed in

/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/ForceDisconnect-Socket/input-RAW

Please let us know if we can fail the jobs, or if there is something to do about them.

Antonio for T0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants