Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issue with RunIII2024Summer24DRPremix #46975

Open
vlimant opened this issue Dec 17, 2024 · 18 comments
Open

Memory Issue with RunIII2024Summer24DRPremix #46975

vlimant opened this issue Dec 17, 2024 · 18 comments

Comments

@vlimant
Copy link
Contributor

vlimant commented Dec 17, 2024

Given the current interest in memory leak (#46901) I am putting here a report for large memory usage in MC production in the Summer24 campaign.
Using this workflow : https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192 as an example (there are several others with the same symptom)
The error report : https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192#GEN-RunIII2024Summer24wmLHEGS-00051_0 shows a good fraction of job going beyond 8G with 4 cores.
Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

and the cmsRun2 indeed does get interrupted early : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/cmsRun2-stdout.log

%MSG-w MemoryCheck:   alpaka_serial_sync::CAHitNtupletAlpakaPhase1:hltPixelTracksSoASerialSync  05-Dec-2024 08:00:37 UTC Run: 1 Event: 1556056481
MemoryCheck: module alpaka_serial_sync::CAHitNtupletAlpakaPhase1:hltPixelTracksSoASerialSync VSIZE 11337 0.09375 RSS 8032.2 -150.844
%MSG
Begin processing the 162nd record. Run 1, Event 1556056498, LumiSection 339751 on stream 3 at 05-Dec-2024 08:00:38.372 UTC
Begin processing the 163rd record. Run 1, Event 1556056512, LumiSection 339751 on stream 0 at 05-Dec-2024 08:00:38.818 UTC
Begin processing the 164th record. Run 1, Event 1556056550, LumiSection 339751 on stream 1 at 05-Dec-2024 08:00:39.121 UTC
Begin processing the 165th record. Run 1, Event 1556056562, LumiSection 339751 on stream 2 at 05-Dec-2024 08:00:39.599 UTC
%MSG-s ShutdownSignal:  PostProcessPath 05-Dec-2024 08:00:41 UTC  PostProcessEvent
an external signal was sent to shutdown the job early.
%MSG
05-Dec-2024 08:00:42 UTC  Closed file file:../cmsRun1/RAWSIMoutput.root

(memory issue not necessarily related to the last module in the last line of the MemoryCheck though)

generic cmsDriver

cmsDriver.py step1 --fileout file:GEN-RunIII2024Summer24DRPremix-00048_step1.root  --pileup_input "dbs:/Neutrino_E-10_gun/RunIIISummer24PrePremix-Premixlib2024_140X_mcRun3_2024_realistic_v26-v1/PREMIX" --mc --eventcontent PREMIXRAW --datatier GEN-SIM-RAW --conditions 140X_mcRun3_2024_realistic_v26 --step DIGI,DATAMIX,L1,DIGI2RAW,HLT:2024v14 --procModifiers premix_stage2 --nThreads 4 --geometry DB:Extended --datamix PreMix --era Run3_2024 

and configuration file https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/09cd8d21fa53b131732b49d6e27a16d1/configFile

https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.pkl and https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.py for that particular failed cmsRun2.

Could someone look into this ?

Side note, to be propagated to WMCore, I note that even though cmsRun2 was killed, the next steps are ran regardless cmsRun3,4,5, and give that the full job is marked as failed, the output will be just send to the bin ; I wonder how (in)efficient this is. i.e to keep running steps even though the output will be tossed away.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 17, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @vlimant.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@vlimant vlimant changed the title Memory Issue with RunIII2024Summer24wmLHEGS Memory Issue with RunIII2024Summer24DRPremix Dec 17, 2024
@srimanob
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

Plotting the RSS and VSIZE from SimpleMemoryCheck periodic printouts for the cmsRun2 step for all the grid jobs reported behind the link gives

image

It seems to me the overall memory footprint is too much for 8 GB limit on 4 cores.

@vlimant
Copy link
Contributor Author

vlimant commented Dec 17, 2024

we would probably need the same for a successful job then, to figure out why the production is not failing all over. There must be something specific to 3% (overall such failure rate of the workflow) of the input files that makes the job start so high on memory usage

@makortel
Copy link
Contributor

makortel commented Dec 18, 2024

Is there any correlation between failures and sites?

@makortel
Copy link
Contributor

I ran the job from a809f51a-6d7c-4c21-8412-518d2b9a3a56-433-0-logArchive (brown curve in #46975 (comment) that had the highest memory usage) locally. I had ran the step 1 (GEN-SIM) locally, and copied the 4 PREMIX files locally. The mixing of the pileup PREMIX events to the signal GEN-SIM is nevertheless likely to be different from the job in production. The job used the same 4 threads and streams as in production. I ran the test on a node that used el8 natively (can be relevant because we've seen elsewhere el9 hosts to lead to higher RSS/VSIZE).

I got this
image

I reran the job then with reading the same files over xrootd, and again without modifying the pileup file list, and got this
image

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

@makortel
Copy link
Contributor

makortel commented Dec 18, 2024

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

This behavior is reproducible with 1 thread (extra cost being around 900 MB), and is visible at the level memory allocations (e.g. with MaxMemoryPreload AllocMonitor).

@makortel
Copy link
Contributor

In the full pileup file list case, the PreMixingModule + EmbeddedRootSource close the files opened at module construction time and open new files, and it seems like something there can lead to ~2 GB higher memory usage when all(?) pixelup files are being used.

This behavior is reproducible with 1 thread (extra cost being around 900 MB), and is visible at the level memory allocations (e.g. with MaxMemoryPreload AllocMonitor).

I think I found the culprit. Comparing IgProf live profiles after 10th event (running with 1 thread) between 1 local pileup file and all xrootd pileup files shows 488 MB increase (per stream!) in edm::InputFileCatalog constructor via EmbeddedRootSource, compare 1 local pileup file to all xrootd pileup files.

The job was configured with 499465 pileup files, translating to ~1025 bytes per file.

On a quick look I see some potential (and confusing) duplication of file name data in the InputFileCatalog, but improving anything there needs to be done with care.

Another quick thought would to be to avoid replicating the InputFileCatalog across streams (every stream sees the same files in any case). This approach would require some moderately complex refactoring in BMixingModule base class, PileUp helper class, and EmbeddedRootSource.

I also can't avoid asking if the scale of 499465 pileup files is really something that every job has to see?

@dan131riley
Copy link

Another quick thought would to be to avoid replicating the InputFileCatalog across streams (every stream sees the same files in any case). This approach would require some moderately complex refactoring in BMixingModule base class, PileUp helper class, and EmbeddedRootSource.

There are some other changes I'd like to make to EmbeddedRootSource to share information across streams (e.g., the mapping from the file identifier to the filename ought to be cached and shared). Refactoring the InputFileCatalog would be a natural fit.

@makortel
Copy link
Contributor

Is there any correlation between failures and sites?

Knowing now a cause, there can be site dependence on the memory usage. The InputFileCatalog constructor resolves all PFNs for each LFN, so more memory will be used on sites that define many <catalog> elements inside <data-access> in their site-local-config.xml compared to sites that define 1 or 2.

@vlimant
Copy link
Contributor Author

vlimant commented Dec 19, 2024

nice !

@makortel
Copy link
Contributor

makortel commented Dec 19, 2024

On a quick look I see some potential (and confusing) duplication of file name data in the InputFileCatalog, but improving anything there needs to be done with care.

#47013 removes duplication (really triplication) of file name data, that seemed to be simple-enough to be done quickly and to be backported to 14_0_X and 14_1_X. MaxMemoryPreload showed 197 MB reduction per stream, so on a 4-thread/stream job that would translate to 787 MB. On a local test at CERN the RSS and VSIZE decreased like this
image

There is more potential (like ~900 MB on 4 streams) with #46975 (comment), but that will take more time. I hope #47013 would at least allow more jobs to stay under the memory limit, it not to avoid all the failures.

I can think of further memory reduction options in InputFileCatalog as well, but those would involve less clear tradeoffs than #47013, and should be followed up separately.

@makortel
Copy link
Contributor

Is there any correlation between failures and sites?

Knowing now a cause, there can be site dependence on the memory usage. The InputFileCatalog constructor resolves all PFNs for each LFN, so more memory will be used on sites that define many <catalog> elements inside <data-access> in their site-local-config.xml compared to sites that define 1 or 2.

Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/

Plotting the RSS and VSIZE from SimpleMemoryCheck periodic printouts for the cmsRun2 step for all the grid jobs reported behind the link gives

Going through the logs again, I see

  • 3d4adb73 was run at T2_UA_KIPT that specifies two catalogs
  • 21e35001 was run at T1_DE_KIT that specifies two catalogs
  • 46f875e1 was run at T1_FR_CCIN2P3 that specifies two catalogs
  • 111cd08f was run at T2_FR_GRIF that specifies two catalogs
  • 487de694 was run at T2_US_Wisconsin that specifies two catalogs
  • a809f51a was run at T1_ES_PIC that specifies six catalogs
  • daef1faf was run at T2_US_Vanderbilt that specifies three catalogs

The T1_ES_PIC specifying six catalogs instead of two/three at least partly explains why the brown curve in #46975 (comment) is significantly higher than the others (interestingly T2_US_Vanderbilt with three catalogs does not stand out from the rest with two catalogs).

@makortel
Copy link
Contributor

I tested running the job with 3 streams and 4 threads, and got
image

(note: compared to my earlier plots the beginning of time is around the beginning of the job rather than the first MemoryCheck printout; the "end" is also confusing, because it is based on the last MemoryCheck printout).

The total job time increased by ~4 %, so the CPU efficiency hit shouldn't be too bad. The 3 streams / 4 threads case would seem like a reasonable workaround if these jobs would need to get processed before the memory improvements get merged, backported to 14_0_X, and new release is built.

@makortel
Copy link
Contributor

And just to report findings from IgProf MEM_LIVE after 10th event, on one local pileup file (so won't show the edm::InputFileCatalog). The profile is here

Total allocated memory at the time of the dump was 2100 MB, divides mainly in

  • 690 MB in EventProcessor construction (link)
    • 440 MB in EDModule constructors (link
    • 100 MB in Services (link)
      • 87 MB in TROOT::InitInterpreter() via TFileAdaptor (link)
    • 58 MB in ESModule constructors (link)
      • mostly from dlopen() (link)
    • 35 MB in ParameterSet registry (link)
    • 17 MB in PoolSource construction ([https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue46975/test_07.10_live/605])
  • 790 MB in EventSetup (link, after subtracting non-ES functions)
    • 370 MB via ESSources (presumably mostly CondDBESSource, link)
      • 95 MB in SiPixel2DTemplateDBObject
      • 47 MB in SiPixelFEDChannelContainer
      • 42 MB in EcalCondObjectContainer<EcalPulseCovariance>
      • 21 MB in SiPixelQualityProbabilities
      • 11 MB in SiStripPedestals
      • 10 MB in SiStripNoises
    • 79 MB from DD4hep_VolumeBasedMagneticFieldESProducerFromDB (link)
    • 48 MB from alpaka_serial_sync::EcalMultifitConditionsHostESProducer (link)
    • 42 MB from CaloGeometryDBEP<EcalPreshowerGeometry, CaloGeometryDBReader> (link)
    • 31 MB from PixelFEDChannelCollectionProducer (link)
    • 22 MB from CaloGeometryDBEP<EcalBarrelGeometry, CaloGeometryDBReader> (link)
  • 37 MB in beginRun (link)
  • 1.4 MB in beginLumi (link)
  • 600 MB in event processing (for one particular event) (link)

@makortel
Copy link
Contributor

I also looked at the total number of memory allocations (as indication of memory churn). The profile showed total of 143 million allocations, of which

  • 76 million allocations in event processing (7.6 million allocations per event) (link)
    • 16 M in PreMixingModule (via PreMixingModule::put, BMixingModule::produce())
      • 3.6 M in EcalDigiProducer::finalizeEvent() (link)
      • 3.3 M in HcalDigiProducer::finalizeEvent() (link)
      • 2.8 M in PreMixingSiPixelWorker::put() (link)
      • 1.5 M in PreMixingEcalWorker::addPileups() (link)
      • 1.2 M in PreMixingSiPixelWorker::addPileups() (link)
      • 1.0 M in PreMixingSiStripWorker::put() (link)
    • 8.1 M in MixingModule (link)
      • 4.6 M in SiPixelDigitizer::accumulate() (link)
      • 1.7 M in SiStripDigitizer::accumulate() (link)
      • 0.4 M in TrackingTruthAccumulator::accumulate() (link)
    • 6.8 M in CSCTriggerPrimitivesProducer (link)
    • 4.4 M in CkfTrackCandidateMakerBase (link)
    • 4.2 M in CSCDigiProducer (link)
    • 3.4 M in EcalSelectiveReadoutProducer (link)
    • 3.2 M in EcalTrigPrimProducer (link)
    • 1.9 M in PFClusterProducer (link)
    • 1.8 M in DeepTauId (link)
    • 1.6 M in HcalTrigPrimDigiProducer (link)
    • 1.6 M in L1TMuonEndCapTrackProducer (link)
    • 1.5 M in RPCDigiProducer (link)
    • 1.4 M in RecoTauProducer (link)
    • 1.1 M in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator> (link)
    • 1.1 M in PFProducer (link)
    • 1.0 M in MuonIdProducer (link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants