Memory Issue with RunIII2024Summer24DRPremix #46975

vlimant · 2024-12-17T11:24:39Z

Given the current interest in memory leak (#46901) I am putting here a report for large memory usage in MC production in the Summer24 campaign.
Using this workflow : https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192 as an example (there are several others with the same symptom)
The error report : https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192#GEN-RunIII2024Summer24wmLHEGS-00051_0 shows a good fraction of job going beyond 8G with 4 cores.
Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

and the cmsRun2 indeed does get interrupted early : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/cmsRun2-stdout.log

%MSG-w MemoryCheck:   alpaka_serial_sync::CAHitNtupletAlpakaPhase1:hltPixelTracksSoASerialSync  05-Dec-2024 08:00:37 UTC Run: 1 Event: 1556056481
MemoryCheck: module alpaka_serial_sync::CAHitNtupletAlpakaPhase1:hltPixelTracksSoASerialSync VSIZE 11337 0.09375 RSS 8032.2 -150.844
%MSG
Begin processing the 162nd record. Run 1, Event 1556056498, LumiSection 339751 on stream 3 at 05-Dec-2024 08:00:38.372 UTC
Begin processing the 163rd record. Run 1, Event 1556056512, LumiSection 339751 on stream 0 at 05-Dec-2024 08:00:38.818 UTC
Begin processing the 164th record. Run 1, Event 1556056550, LumiSection 339751 on stream 1 at 05-Dec-2024 08:00:39.121 UTC
Begin processing the 165th record. Run 1, Event 1556056562, LumiSection 339751 on stream 2 at 05-Dec-2024 08:00:39.599 UTC
%MSG-s ShutdownSignal:  PostProcessPath 05-Dec-2024 08:00:41 UTC  PostProcessEvent
an external signal was sent to shutdown the job early.
%MSG
05-Dec-2024 08:00:42 UTC  Closed file file:../cmsRun1/RAWSIMoutput.root

(memory issue not necessarily related to the last module in the last line of the MemoryCheck though)

generic cmsDriver

cmsDriver.py step1 --fileout file:GEN-RunIII2024Summer24DRPremix-00048_step1.root  --pileup_input "dbs:/Neutrino_E-10_gun/RunIIISummer24PrePremix-Premixlib2024_140X_mcRun3_2024_realistic_v26-v1/PREMIX" --mc --eventcontent PREMIXRAW --datatier GEN-SIM-RAW --conditions 140X_mcRun3_2024_realistic_v26 --step DIGI,DATAMIX,L1,DIGI2RAW,HLT:2024v14 --procModifiers premix_stage2 --nThreads 4 --geometry DB:Extended --datamix PreMix --era Run3_2024

and configuration file https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/09cd8d21fa53b131732b49d6e27a16d1/configFile

https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.pkl and https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.py for that particular failed cmsRun2.

Could someone look into this ?

Side note, to be propagated to WMCore, I note that even though cmsRun2 was killed, the next steps are ran regardless cmsRun3,4,5, and give that the full job is marked as failed, the output will be just send to the bin ; I wonder how (in)efficient this is. i.e to keep running steps even though the output will be tossed away.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-12-17T11:25:02Z

cms-bot internal usage

cmsbuild · 2024-12-17T11:25:02Z

A new Issue was created by @vlimant.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

srimanob · 2024-12-17T12:28:58Z

assign core

cmsbuild · 2024-12-17T12:29:18Z

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-12-17T17:37:24Z

Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

Plotting the RSS and VSIZE from SimpleMemoryCheck periodic printouts for the cmsRun2 step for all the grid jobs reported behind the link gives

It seems to me the overall memory footprint is too much for 8 GB limit on 4 cores.

vlimant · 2024-12-17T18:09:52Z

we would probably need the same for a successful job then, to figure out why the production is not failing all over. There must be something specific to 3% (overall such failure rate of the workflow) of the input files that makes the job start so high on memory usage

makortel · 2024-12-18T20:00:25Z

Is there any correlation between failures and sites?

makortel · 2024-12-18T22:07:05Z

I ran the job from a809f51a-6d7c-4c21-8412-518d2b9a3a56-433-0-logArchive (brown curve in #46975 (comment) that had the highest memory usage) locally. I had ran the step 1 (GEN-SIM) locally, and copied the 4 PREMIX files locally. The mixing of the pileup PREMIX events to the signal GEN-SIM is nevertheless likely to be different from the job in production. The job used the same 4 threads and streams as in production. I ran the test on a node that used el8 natively (can be relevant because we've seen elsewhere el9 hosts to lead to higher RSS/VSIZE).

I got this

I reran the job then with reading the same files over xrootd, and again without modifying the pileup file list, and got this

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

makortel · 2024-12-18T22:33:39Z

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

This behavior is reproducible with 1 thread (extra cost being around 900 MB), and is visible at the level memory allocations (e.g. with MaxMemoryPreload AllocMonitor).

makortel · 2024-12-18T23:12:09Z

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

This behavior is reproducible with 1 thread (extra cost being around 900 MB), and is visible at the level memory allocations (e.g. with MaxMemoryPreload AllocMonitor).

I think I found the culprit. Comparing IgProf live profiles after 10th event (running with 1 thread) between 1 local pileup file and all xrootd pileup files shows 488 MB increase (per stream!) in edm::InputFileCatalog constructor via EmbeddedRootSource, compare 1 local pileup file to all xrootd pileup files.

The job was configured with 499465 pileup files, translating to ~1025 bytes per file.

On a quick look I see some potential (and confusing) duplication of file name data in the InputFileCatalog, but improving anything there needs to be done with care.

Another quick thought would to be to avoid replicating the InputFileCatalog across streams (every stream sees the same files in any case). This approach would require some moderately complex refactoring in BMixingModule base class, PileUp helper class, and EmbeddedRootSource.

I also can't avoid asking if the scale of 499465 pileup files is really something that every job has to see?

dan131riley · 2024-12-18T23:20:17Z

Another quick thought would to be to avoid replicating the InputFileCatalog across streams (every stream sees the same files in any case). This approach would require some moderately complex refactoring in BMixingModule base class, PileUp helper class, and EmbeddedRootSource.

There are some other changes I'd like to make to EmbeddedRootSource to share information across streams (e.g., the mapping from the file identifier to the filename ought to be cached and shared). Refactoring the InputFileCatalog would be a natural fit.

makortel · 2024-12-18T23:53:33Z

Is there any correlation between failures and sites?

Knowing now a cause, there can be site dependence on the memory usage. The InputFileCatalog constructor resolves all PFNs for each LFN, so more memory will be used on sites that define many <catalog> elements inside <data-access> in their site-local-config.xml compared to sites that define 1 or 2.

vlimant · 2024-12-19T07:09:06Z

nice !

makortel · 2024-12-19T14:39:31Z

On a quick look I see some potential (and confusing) duplication of file name data in the InputFileCatalog, but improving anything there needs to be done with care.

#47013 removes duplication (really triplication) of file name data, that seemed to be simple-enough to be done quickly and to be backported to 14_0_X and 14_1_X. MaxMemoryPreload showed 197 MB reduction per stream, so on a 4-thread/stream job that would translate to 787 MB. On a local test at CERN the RSS and VSIZE decreased like this

There is more potential (like ~900 MB on 4 streams) with #46975 (comment), but that will take more time. I hope #47013 would at least allow more jobs to stay under the memory limit, it not to avoid all the failures.

I can think of further memory reduction options in InputFileCatalog as well, but those would involve less clear tradeoffs than #47013, and should be followed up separately.

makortel · 2024-12-19T15:03:57Z

Is there any correlation between failures and sites?

Knowing now a cause, there can be site dependence on the memory usage. The InputFileCatalog constructor resolves all PFNs for each LFN, so more memory will be used on sites that define many <catalog> elements inside <data-access> in their site-local-config.xml compared to sites that define 1 or 2.

Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

Plotting the RSS and VSIZE from SimpleMemoryCheck periodic printouts for the cmsRun2 step for all the grid jobs reported behind the link gives

Going through the logs again, I see

3d4adb73 was run at T2_UA_KIPT that specifies two catalogs
21e35001 was run at T1_DE_KIT that specifies two catalogs
46f875e1 was run at T1_FR_CCIN2P3 that specifies two catalogs
111cd08f was run at T2_FR_GRIF that specifies two catalogs
487de694 was run at T2_US_Wisconsin that specifies two catalogs
a809f51a was run at T1_ES_PIC that specifies six catalogs
daef1faf was run at T2_US_Vanderbilt that specifies three catalogs

The T1_ES_PIC specifying six catalogs instead of two/three at least partly explains why the brown curve in #46975 (comment) is significantly higher than the others (interestingly T2_US_Vanderbilt with three catalogs does not stand out from the rest with two catalogs).

makortel · 2024-12-20T20:20:29Z

I tested running the job with 3 streams and 4 threads, and got

(note: compared to my earlier plots the beginning of time is around the beginning of the job rather than the first MemoryCheck printout; the "end" is also confusing, because it is based on the last MemoryCheck printout).

The total job time increased by ~4 %, so the CPU efficiency hit shouldn't be too bad. The 3 streams / 4 threads case would seem like a reasonable workaround if these jobs would need to get processed before the memory improvements get merged, backported to 14_0_X, and new release is built.

makortel · 2024-12-20T20:23:50Z

And just to report findings from IgProf MEM_LIVE after 10th event, on one local pileup file (so won't show the edm::InputFileCatalog). The profile is here

Total allocated memory at the time of the dump was 2100 MB, divides mainly in

690 MB in EventProcessor construction (link)
- 440 MB in EDModule constructors (link
  - 64 MB in TSGForOIDNN (link
    - nearly everything in Tensorflow graphs and sessions (module is edm::global, so these are shared between streams)
  - 33 MB in ObjectSelectorBase<pf2pat::GenericPFCandidateSelectorDefinition> (link)
    - Nearly all in cut parser
  - 30 MB in CSCTriggerPrimitivesProducer (link) (reported already in [Run3 PromptReco] CSCTriggerPrimitivesProducer constructor uses 30 MB / stream #46432)
  - 31+18=49 MB in DeepTauId (initializeGlobalCache, constructor)
  - 17 MB in PFRecoTauChargedHadronProducer (link)
  - 17 MB in PreMixingModule (link)
    - 13 MB in PreMixingEcalWorker (link)
  - 153 MB in product registry (link)
- 100 MB in Services (link)
  - 87 MB in TROOT::InitInterpreter() via TFileAdaptor (link)
- 58 MB in ESModule constructors (link)
  - mostly from dlopen() (link)
- 35 MB in ParameterSet registry (link)
- 17 MB in PoolSource construction ([https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue46975/test_07.10_live/605])
790 MB in EventSetup (link, after subtracting non-ES functions)
- 370 MB via ESSources (presumably mostly CondDBESSource, link)
  - 95 MB in SiPixel2DTemplateDBObject
  - 47 MB in SiPixelFEDChannelContainer
  - 42 MB in EcalCondObjectContainer<EcalPulseCovariance>
  - 21 MB in SiPixelQualityProbabilities
  - 11 MB in SiStripPedestals
  - 10 MB in SiStripNoises
- 79 MB from DD4hep_VolumeBasedMagneticFieldESProducerFromDB (link)
- 48 MB from alpaka_serial_sync::EcalMultifitConditionsHostESProducer (link)
- 42 MB from CaloGeometryDBEP<EcalPreshowerGeometry, CaloGeometryDBReader> (link)
- 31 MB from PixelFEDChannelCollectionProducer (link)
- 22 MB from CaloGeometryDBEP<EcalBarrelGeometry, CaloGeometryDBReader> (link)
37 MB in beginRun (link)
1.4 MB in beginLumi (link)
600 MB in event processing (for one particular event) (link)
- 220 MB in PreMixingModule (BMixingModule::produce(), PreMixingModule::put())
  - 59 MB in HcalSignalGenerator<HcalQIE11DigitizerTraits>::fill() (link)
  - 48 MB in PreMixingSiPixelWorker::initializeEvent() (link)
    - All in SiPixelTemplate2D::pushfile()
  - 17 MB in PreMixingEcalWorker::addPileups() (link)
- 59 MB in MixingModule (BMixingModule::produce())
  - 48 MB in SiPixelDigitizer::initializeEvent() (link)
    - All in SiPixelTemplate2D::pushfile(), apparently another copy of template data wrt. PreMixingModule
- 49 MB in PoolOutputModule (link)
- 48 MB in L1TMuonEndCapTrackProducer (link) (reported already in L1TMuonEndCapTrackProducer::produce() takes 96 MB memory per stream #42526)
- 46 MB in DeepTauId (link)
- 17 MB in BoostedJetONNXJetTagsProducer (link)
- 15 MB in DeepBoostedJetTagInfoProducer (link)
- 11 MB in SiStripClusterizerFromRaw (link)
- 11 MB in alpaka_serial_sync::PFClusterSoAProducer (link)

makortel · 2024-12-20T20:25:41Z

I also looked at the total number of memory allocations (as indication of memory churn). The profile showed total of 143 million allocations, of which

76 million allocations in event processing (7.6 million allocations per event) (link)
- 16 M in PreMixingModule (via PreMixingModule::put, BMixingModule::produce())
  - 3.6 M in EcalDigiProducer::finalizeEvent() (link)
  - 3.3 M in HcalDigiProducer::finalizeEvent() (link)
  - 2.8 M in PreMixingSiPixelWorker::put() (link)
  - 1.5 M in PreMixingEcalWorker::addPileups() (link)
  - 1.2 M in PreMixingSiPixelWorker::addPileups() (link)
  - 1.0 M in PreMixingSiStripWorker::put() (link)
- 8.1 M in MixingModule (link)
  - 4.6 M in SiPixelDigitizer::accumulate() (link)
  - 1.7 M in SiStripDigitizer::accumulate() (link)
  - 0.4 M in TrackingTruthAccumulator::accumulate() (link)
- 6.8 M in CSCTriggerPrimitivesProducer (link)
- 4.4 M in CkfTrackCandidateMakerBase (link)
- 4.2 M in CSCDigiProducer (link)
- 3.4 M in EcalSelectiveReadoutProducer (link)
- 3.2 M in EcalTrigPrimProducer (link)
- 1.9 M in PFClusterProducer (link)
- 1.8 M in DeepTauId (link)
- 1.6 M in HcalTrigPrimDigiProducer (link)
- 1.6 M in L1TMuonEndCapTrackProducer (link)
- 1.5 M in RPCDigiProducer (link)
- 1.4 M in RecoTauProducer (link)
- 1.1 M in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator> (link)
- 1.1 M in PFProducer (link)
- 1.0 M in MuonIdProducer (link)

cmsbuild added the pending-assignment label Dec 17, 2024

vlimant changed the title ~~Memory Issue with RunIII2024Summer24wmLHEGS~~ Memory Issue with RunIII2024Summer24DRPremix Dec 17, 2024

vlimant mentioned this issue Dec 17, 2024

Running steps after early step interruption for memory usage dmwm/WMCore#12209

Open

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Dec 17, 2024

makortel mentioned this issue Dec 19, 2024

Remove redundant data members from InputFileCatalog to reduce memory use #47013

Open

This was referenced Dec 20, 2024

[Run3 PromptReco] CSCTriggerPrimitivesProducer constructor uses 30 MB / stream #46432

Open

L1TMuonEndCapTrackProducer::produce() takes 96 MB memory per stream #42526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Issue with RunIII2024Summer24DRPremix #46975

Memory Issue with RunIII2024Summer24DRPremix #46975

vlimant commented Dec 17, 2024

cmsbuild commented Dec 17, 2024 •

edited

Loading

cmsbuild commented Dec 17, 2024

srimanob commented Dec 17, 2024

cmsbuild commented Dec 17, 2024

makortel commented Dec 17, 2024

vlimant commented Dec 17, 2024

makortel commented Dec 18, 2024 •

edited

Loading

makortel commented Dec 18, 2024

makortel commented Dec 18, 2024 •

edited

Loading

makortel commented Dec 18, 2024

dan131riley commented Dec 18, 2024

makortel commented Dec 18, 2024

vlimant commented Dec 19, 2024

makortel commented Dec 19, 2024 •

edited

Loading

makortel commented Dec 19, 2024

makortel commented Dec 20, 2024

makortel commented Dec 20, 2024

makortel commented Dec 20, 2024

Memory Issue with RunIII2024Summer24DRPremix #46975

Memory Issue with RunIII2024Summer24DRPremix #46975

Comments

vlimant commented Dec 17, 2024

cmsbuild commented Dec 17, 2024 • edited Loading

cmsbuild commented Dec 17, 2024

srimanob commented Dec 17, 2024

cmsbuild commented Dec 17, 2024

makortel commented Dec 17, 2024

vlimant commented Dec 17, 2024

makortel commented Dec 18, 2024 • edited Loading

makortel commented Dec 18, 2024

makortel commented Dec 18, 2024 • edited Loading

makortel commented Dec 18, 2024

dan131riley commented Dec 18, 2024

makortel commented Dec 18, 2024

vlimant commented Dec 19, 2024

makortel commented Dec 19, 2024 • edited Loading

makortel commented Dec 19, 2024

makortel commented Dec 20, 2024

makortel commented Dec 20, 2024

makortel commented Dec 20, 2024

cmsbuild commented Dec 17, 2024 •

edited

Loading

makortel commented Dec 18, 2024 •

edited

Loading

makortel commented Dec 18, 2024 •

edited

Loading

makortel commented Dec 19, 2024 •

edited

Loading