-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory problem in ReReco-Run2024C-JetMET1 pilot #46901
Comments
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
cms-bot internal usage |
A new Issue was created by @makortel. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Some overall properties of the job
|
perhaps its wiser to use 8 cores like the tier0 does? That would gain quite a lot of memory headroom per core |
The failed job was configured to use 8 threads. |
whoops - i was fooled by some of the plots - sorry for the noise. |
First observation with @Dr15Jones was a rediscovery of #46526 (comment) that was fixed in #46543 (14_2_X) / #46567 (14_1_X). The fix is being backported to 14_0_X as part of #46903. |
I just made a quick plot from printout of SimpleMemoryCheck, this PR (#46903) should help for memory reduction, but not the leak. |
Hi @Dr15Jones Thx. Which backport are you testing? The PR I made or other PRs which @makortel proposed to backport also. |
Thanks @makortel for the script to make a plot. I try to run with backport PR, it shows that PR helps to reduce the memory. |
I was able to uncover a ~ 1k/event memory leak here cmssw/EventFilter/L1TRawToDigi/plugins/implementations_stage2/RegionalMuonGMTUnpacker.cc Lines 47 to 48 in 9a4b9e4
This was found using the prototype ModuleEventAllocMonitor |
#46918 fixes the problem in master. |
So after applying the backport #46903 and memory leak fix #46918 (the latter having a much smaller effect) I see that the allocations (using the AllocMonitor system to record new/delete calls) shows much more stable behavior and comparing the final results for RSS and allocations gives Here I'm must processing the first file in the job and I'm reading that file locally. |
Hi @Dr15Jones |
Here is another backport I made, #46942. It includes
|
Note to @antoniovilela @mandrenguyen We need to cut the release 14_0 with memory fix PR early next week. Currently we have |
The relevant backport is #46950 |
A pilot job of the Run2024C ReReco was killed because of using too much memory
https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2024C_JetMET1_pilot_241122_102431_7689/50660/DataProcessing/
The job was ran in CMSSW_14_0_19.
This issue is about studying the memory behavior of the job above.
This problem is reported also in https://its.cern.ch/jira/browse/CMSCOMPPR-56784 and https://gitlab.cern.ch/groups/cms-ppd/-/epics/12.
The text was updated successfully, but these errors were encountered: