-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long Jobs with failures in CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut
#46757
Comments
cms-bot internal usage |
A new Issue was created by @LinaresToine. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction |
New categories assigned: reconstruction @jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/tracking-pog-l2 |
In this case it looked like the event
|
Also here only
In this case the long-running event seems to be |
@cms-sw/reconstruction-l2 @cms-sw/tracking-pog-l2 Could you someone take a look of what is going on with these long-running events? |
the file does not exist
how many events or lumisections are in this job? |
We have notice many paused jobs with the same pattern, and all tarballs have been moved to
The missing tarball mentioned by @slava77 is there as well:
Regarding this job, it has 12204 events, and I only see one lumisection. I'd also like to bring to attention one of the many more occurrences. This one apparently used two RAW files, one after the other, and exhibited long periods of time of inactivity, which is consistent with the processing of heavy events according to @makortel. The tarball of this job has been isolated in
So far we have 71 paused jobs that match the pattern studied in this issue, and all input files add up to approximately 400 thousand events.
Please let us know if we can fail the jobs, or if there is something to do about them. Antonio for T0 |
Hello all,
I am opening this issue for further investigation of the ongoing Long Jobs that have been seen at Tier 0. The original report is in cms-talk, where @makortel suggests an issue with module
CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut
taking too long with some events.The report features a paused job with logs and tarballs of a couple of retries in
/eos/user/c/cmst0/public/PausedJobs/HIRun2024A/LongJob/job_2777101
A long job from
HIRun2024B
, with a failure in the same module, has also been retried twice. The tarball for such job is in/eos/home-c/cmst0/public/PausedJobs/HIRun2024B/LongJob/36388e0c-9895-41d8-b59f-a0d7393c5508-139-1-logArchive.tar.gz
Could an expert please take a look?
The text was updated successfully, but these errors were encountered: