Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption higher than 2.5 GB #1725

Open
rahmans1 opened this issue Jan 31, 2025 · 7 comments
Open

Memory consumption higher than 2.5 GB #1725

rahmans1 opened this issue Jan 31, 2025 · 7 comments
Assignees

Comments

@rahmans1
Copy link
Contributor

Environment: (where does this bug occur, have you tried other environments)

  • Which branch (often main for latest released): 25.01.1
  • Which version (or HEAD for the most recent on git): HEAD
  • Any specific OS or system where the issue occurs? OSG
  • Any special versions of ROOT or Geant4? No

Steps to reproduce: (give a step by step account of how to trigger the bug)

A significant number of jobs on the pythia6 dataset are requiring more than 2.5 G memory

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
rahmans1 25.01.1/epic_craterlake/pythia6-eic-1.0.0_5x41_q2_0to1_ep_noradcor- 1/30 13:56 311 2 500 187 16700 25202904.7-999

Out of the 311+187=498 jobs completed or on hold, 247 had peak RSS greater than 2.5 G.

[rahmans1@ap23 osg_25202904_errors]$ for log in osg_25202904_*.log; do awk '/ResidentSetSize/ { if ($1 > max) max=$1 } END { print "Peak RSS for", FILENAME ":", max " KB" }' "$log"; done

Peak RSS for osg_25202904_100.log: 2644292 KB
Peak RSS for osg_25202904_101.log: 2633652 KB
Peak RSS for osg_25202904_102.log: 2625872 KB
Peak RSS for osg_25202904_103.log: 2637752 KB
Peak RSS for osg_25202904_104.log: 2625772 KB
Peak RSS for osg_25202904_107.log: 2631856 KB
Peak RSS for osg_25202904_108.log: 2633024 KB
Peak RSS for osg_25202904_109.log: 2635700 KB
Peak RSS for osg_25202904_10.log: 2619404 KB
Peak RSS for osg_25202904_110.log: 2629508 KB
Peak RSS for osg_25202904_111.log: 2622676 KB
Peak RSS for osg_25202904_112.log: 2646644 KB
Peak RSS for osg_25202904_113.log: 2625180 KB
Peak RSS for osg_25202904_114.log: 2600980 KB
Peak RSS for osg_25202904_115.log: 2622536 KB
Peak RSS for osg_25202904_116.log: 2630968 KB
Peak RSS for osg_25202904_117.log: 2636600 KB
Peak RSS for osg_25202904_118.log: 2609236 KB
Peak RSS for osg_25202904_119.log: 2627756 KB
Peak RSS for osg_25202904_11.log: 2624416 KB
Peak RSS for osg_25202904_120.log: 2637768 KB
Peak RSS for osg_25202904_121.log: 2639696 KB
Peak RSS for osg_25202904_122.log: 2625084 KB
Peak RSS for osg_25202904_124.log: 2630588 KB
Peak RSS for osg_25202904_126.log: 2613000 KB
Peak RSS for osg_25202904_12.log: 2625276 KB
Peak RSS for osg_25202904_139.log: 2641532 KB
Peak RSS for osg_25202904_13.log: 2618640 KB
Peak RSS for osg_25202904_144.log: 2623116 KB
Peak RSS for osg_25202904_145.log: 2615744 KB
Peak RSS for osg_25202904_146.log: 2640736 KB
Peak RSS for osg_25202904_147.log: 2782280 KB
Peak RSS for osg_25202904_148.log: 2628732 KB
Peak RSS for osg_25202904_149.log: 2699336 KB
Peak RSS for osg_25202904_150.log: 2622788 KB
Peak RSS for osg_25202904_151.log: 2619580 KB
Peak RSS for osg_25202904_152.log: 2680948 KB
Peak RSS for osg_25202904_153.log: 2600524 KB
Peak RSS for osg_25202904_154.log: 2623232 KB
Peak RSS for osg_25202904_155.log: 2622708 KB
Peak RSS for osg_25202904_157.log: 2784176 KB
Peak RSS for osg_25202904_158.log: 2790132 KB
Peak RSS for osg_25202904_15.log: 2629224 KB
Peak RSS for osg_25202904_160.log: 2608848 KB
Peak RSS for osg_25202904_161.log: 2610860 KB
Peak RSS for osg_25202904_163.log: 2629564 KB
Peak RSS for osg_25202904_164.log: 2636156 KB
Peak RSS for osg_25202904_165.log: 2622172 KB
Peak RSS for osg_25202904_166.log: 2625320 KB
Peak RSS for osg_25202904_16.log: 2628268 KB
Peak RSS for osg_25202904_17.log: 2630448 KB
Peak RSS for osg_25202904_186.log: 2739176 KB
Peak RSS for osg_25202904_188.log: 2607300 KB
Peak RSS for osg_25202904_189.log: 2638620 KB
Peak RSS for osg_25202904_18.log: 2634628 KB
Peak RSS for osg_25202904_190.log: 2627728 KB
Peak RSS for osg_25202904_191.log: 2633696 KB
Peak RSS for osg_25202904_192.log: 2623460 KB
Peak RSS for osg_25202904_193.log: 2628712 KB
Peak RSS for osg_25202904_195.log: 2607912 KB
Peak RSS for osg_25202904_196.log: 2622680 KB
Peak RSS for osg_25202904_197.log: 2624984 KB
Peak RSS for osg_25202904_198.log: 2620424 KB
Peak RSS for osg_25202904_19.log: 2608828 KB
Peak RSS for osg_25202904_208.log: 2725344 KB
Peak RSS for osg_25202904_209.log: 2644248 KB
Peak RSS for osg_25202904_20.log: 2626708 KB
Peak RSS for osg_25202904_212.log: 2625444 KB
Peak RSS for osg_25202904_213.log: 2622260 KB
Peak RSS for osg_25202904_214.log: 2613604 KB
Peak RSS for osg_25202904_215.log: 2639060 KB
Peak RSS for osg_25202904_216.log: 2626600 KB
Peak RSS for osg_25202904_217.log: 2630284 KB
Peak RSS for osg_25202904_218.log: 2633208 KB
Peak RSS for osg_25202904_21.log: 2421876 KB
Peak RSS for osg_25202904_224.log: 2621932 KB
Peak RSS for osg_25202904_225.log: 2627532 KB
Peak RSS for osg_25202904_226.log: 2625416 KB
Peak RSS for osg_25202904_227.log: 2619584 KB
Peak RSS for osg_25202904_228.log: 2638320 KB
Peak RSS for osg_25202904_229.log: 2626812 KB
Peak RSS for osg_25202904_22.log: 2633048 KB
Peak RSS for osg_25202904_230.log: 2717336 KB
Peak RSS for osg_25202904_232.log: 2636296 KB
Peak RSS for osg_25202904_23.log: 2625616 KB
Peak RSS for osg_25202904_240.log: 2616592 KB
Peak RSS for osg_25202904_241.log: 2617680 KB
Peak RSS for osg_25202904_242.log: 2609648 KB
Peak RSS for osg_25202904_243.log: 2626540 KB
Peak RSS for osg_25202904_244.log: 2634596 KB
Peak RSS for osg_25202904_245.log: 2592764 KB
Peak RSS for osg_25202904_246.log: 2595708 KB
Peak RSS for osg_25202904_247.log: 2630776 KB
Peak RSS for osg_25202904_248.log: 2608412 KB
Peak RSS for osg_25202904_249.log: 2638788 KB
Peak RSS for osg_25202904_250.log: 2712960 KB
Peak RSS for osg_25202904_251.log: 2628172 KB
Peak RSS for osg_25202904_252.log: 2587732 KB
Peak RSS for osg_25202904_253.log: 2640792 KB
Peak RSS for osg_25202904_254.log: 2630324 KB
Peak RSS for osg_25202904_255.log: 2629340 KB
Peak RSS for osg_25202904_256.log: 2630292 KB
Peak RSS for osg_25202904_257.log: 2616168 KB
Peak RSS for osg_25202904_258.log: 2628708 KB
Peak RSS for osg_25202904_259.log: 2627540 KB
Peak RSS for osg_25202904_261.log: 2627336 KB
Peak RSS for osg_25202904_262.log: 2627164 KB
Peak RSS for osg_25202904_265.log: 2637364 KB
Peak RSS for osg_25202904_266.log: 2627484 KB
Peak RSS for osg_25202904_268.log: 2605808 KB
Peak RSS for osg_25202904_284.log: 2702160 KB
Peak RSS for osg_25202904_285.log: 2619128 KB
Peak RSS for osg_25202904_286.log: 2644776 KB
Peak RSS for osg_25202904_287.log: 2602508 KB
Peak RSS for osg_25202904_288.log: 2625956 KB
Peak RSS for osg_25202904_289.log: 2607972 KB
Peak RSS for osg_25202904_28.log: 2627988 KB
Peak RSS for osg_25202904_290.log: 2613232 KB
Peak RSS for osg_25202904_291.log: 2631024 KB
Peak RSS for osg_25202904_292.log: 2650304 KB
Peak RSS for osg_25202904_293.log: 2626076 KB
Peak RSS for osg_25202904_295.log: 2633088 KB
Peak RSS for osg_25202904_296.log: 2613260 KB
Peak RSS for osg_25202904_297.log: 2623460 KB
Peak RSS for osg_25202904_298.log: 2602964 KB
Peak RSS for osg_25202904_299.log: 2623524 KB
Peak RSS for osg_25202904_29.log: 2603584 KB
Peak RSS for osg_25202904_300.log: 2636844 KB
Peak RSS for osg_25202904_302.log: 2630032 KB
Peak RSS for osg_25202904_303.log: 2639240 KB
Peak RSS for osg_25202904_304.log: 2625792 KB
Peak RSS for osg_25202904_305.log: 2611820 KB
Peak RSS for osg_25202904_306.log: 2601408 KB
Peak RSS for osg_25202904_307.log: 2600168 KB
Peak RSS for osg_25202904_30.log: 2627348 KB
Peak RSS for osg_25202904_313.log: 2624020 KB
Peak RSS for osg_25202904_314.log: 2615496 KB
Peak RSS for osg_25202904_315.log: 2616196 KB
Peak RSS for osg_25202904_316.log: 2606972 KB
Peak RSS for osg_25202904_317.log: 2631924 KB
Peak RSS for osg_25202904_318.log: 2620304 KB
Peak RSS for osg_25202904_319.log: 2627840 KB
Peak RSS for osg_25202904_31.log: 2631068 KB
Peak RSS for osg_25202904_320.log: 2632364 KB
Peak RSS for osg_25202904_321.log: 2642780 KB
Peak RSS for osg_25202904_322.log: 2610708 KB
Peak RSS for osg_25202904_323.log: 2628104 KB
Peak RSS for osg_25202904_324.log: 2623936 KB
Peak RSS for osg_25202904_32.log: 2625276 KB
Peak RSS for osg_25202904_335.log: 2631228 KB
Peak RSS for osg_25202904_336.log: 2626920 KB
Peak RSS for osg_25202904_338.log: 2628280 KB
Peak RSS for osg_25202904_339.log: 2615168 KB
Peak RSS for osg_25202904_33.log: 2632248 KB
Peak RSS for osg_25202904_340.log: 2631144 KB
Peak RSS for osg_25202904_341.log: 2615920 KB
Peak RSS for osg_25202904_342.log: 2615192 KB
Peak RSS for osg_25202904_343.log: 2633472 KB
Peak RSS for osg_25202904_344.log: 2619700 KB
Peak RSS for osg_25202904_345.log: 2617624 KB
Peak RSS for osg_25202904_346.log: 2635236 KB
Peak RSS for osg_25202904_349.log: 2622100 KB
Peak RSS for osg_25202904_34.log: 2629084 KB
Peak RSS for osg_25202904_35.log: 2628852 KB
Peak RSS for osg_25202904_36.log: 2623440 KB
Peak RSS for osg_25202904_37.log: 2626420 KB
Peak RSS for osg_25202904_38.log: 2632200 KB
Peak RSS for osg_25202904_391.log: 2625720 KB
Peak RSS for osg_25202904_392.log: 2700840 KB
Peak RSS for osg_25202904_393.log: 2637124 KB
Peak RSS for osg_25202904_395.log: 2622940 KB
Peak RSS for osg_25202904_396.log: 2622904 KB
Peak RSS for osg_25202904_397.log: 2627188 KB
Peak RSS for osg_25202904_398.log: 2608212 KB
Peak RSS for osg_25202904_399.log: 2626264 KB
Peak RSS for osg_25202904_400.log: 2629332 KB
Peak RSS for osg_25202904_401.log: 2644496 KB
Peak RSS for osg_25202904_402.log: 2634736 KB
Peak RSS for osg_25202904_403.log: 2631556 KB
Peak RSS for osg_25202904_404.log: 2625024 KB
Peak RSS for osg_25202904_405.log: 2617392 KB
Peak RSS for osg_25202904_406.log: 2629652 KB
Peak RSS for osg_25202904_407.log: 2625912 KB
Peak RSS for osg_25202904_409.log: 2624172 KB
Peak RSS for osg_25202904_40.log: 2616604 KB
Peak RSS for osg_25202904_413.log: 2622788 KB
Peak RSS for osg_25202904_436.log: 2625876 KB
Peak RSS for osg_25202904_437.log: 2636116 KB
Peak RSS for osg_25202904_438.log: 2626852 KB
Peak RSS for osg_25202904_439.log: 2663172 KB
Peak RSS for osg_25202904_440.log: 2614152 KB
Peak RSS for osg_25202904_441.log: 2635120 KB
Peak RSS for osg_25202904_442.log: 2633364 KB
Peak RSS for osg_25202904_443.log: 2634956 KB
Peak RSS for osg_25202904_444.log: 2621944 KB
Peak RSS for osg_25202904_445.log: 2629872 KB
Peak RSS for osg_25202904_446.log: 2618224 KB
Peak RSS for osg_25202904_447.log: 2613692 KB
Peak RSS for osg_25202904_448.log: 2627920 KB
Peak RSS for osg_25202904_449.log: 2627552 KB
Peak RSS for osg_25202904_450.log: 2637964 KB
Peak RSS for osg_25202904_451.log: 2623312 KB
Peak RSS for osg_25202904_452.log: 2629712 KB
Peak RSS for osg_25202904_453.log: 2622400 KB
Peak RSS for osg_25202904_457.log: 2617028 KB
Peak RSS for osg_25202904_458.log: 2632260 KB
Peak RSS for osg_25202904_459.log: 2621640 KB
Peak RSS for osg_25202904_461.log: 2622868 KB
Peak RSS for osg_25202904_462.log: 2628236 KB
Peak RSS for osg_25202904_463.log: 2641500 KB
Peak RSS for osg_25202904_464.log: 2627864 KB
Peak RSS for osg_25202904_465.log: 2632244 KB
Peak RSS for osg_25202904_466.log: 2631156 KB
Peak RSS for osg_25202904_467.log: 2623392 KB
Peak RSS for osg_25202904_469.log: 2623260 KB
Peak RSS for osg_25202904_474.log: 2416164 KB
Peak RSS for osg_25202904_475.log: 2636284 KB
Peak RSS for osg_25202904_476.log: 2623360 KB
Peak RSS for osg_25202904_477.log: 2589000 KB
Peak RSS for osg_25202904_47.log: 2621804 KB
Peak RSS for osg_25202904_481.log: 2718756 KB
Peak RSS for osg_25202904_484.log: 2611760 KB
Peak RSS for osg_25202904_485.log: 2621916 KB
Peak RSS for osg_25202904_486.log: 2599576 KB
Peak RSS for osg_25202904_488.log: 2648196 KB
Peak RSS for osg_25202904_489.log: 2629284 KB
Peak RSS for osg_25202904_48.log: 2500956 KB
Peak RSS for osg_25202904_490.log: 2624380 KB
Peak RSS for osg_25202904_491.log: 2626716 KB
Peak RSS for osg_25202904_492.log: 2636212 KB
Peak RSS for osg_25202904_493.log: 2620636 KB
Peak RSS for osg_25202904_494.log: 2620460 KB
Peak RSS for osg_25202904_495.log: 2622148 KB
Peak RSS for osg_25202904_496.log: 2633912 KB
Peak RSS for osg_25202904_498.log: 2782436 KB
Peak RSS for osg_25202904_499.log: 2760256 KB
Peak RSS for osg_25202904_61.log: 2504656 KB
Peak RSS for osg_25202904_63.log: 2560184 KB
Peak RSS for osg_25202904_64.log: 2634648 KB
Peak RSS for osg_25202904_67.log: 2628028 KB
Peak RSS for osg_25202904_68.log: 2621624 KB
Peak RSS for osg_25202904_69.log: 2626144 KB
Peak RSS for osg_25202904_7.log: 2627608 KB
Peak RSS for osg_25202904_92.log: 2622344 KB
Peak RSS for osg_25202904_94.log: 2644336 KB
Peak RSS for osg_25202904_96.log: 2631528 KB
Peak RSS for osg_25202904_98.log: 2642816 KB

Expected Result: (what do you expect when you execute the steps above)

Actual Result: (what do you get when you execute the steps above)

@wdconinc
Copy link
Contributor

@nathanwbrei Does JANA2 have any way of accounting for memory use by factory and service?

@rahmans1
Copy link
Contributor Author

Adding to the previous post this happens for other datasets too and the frequency depends on the dataset. DIS NC 18x275 (JOB ID 1837) for example has 10220 total jobs and the memory exceeded alarm shows up in logs 1571 times or 1571/10220=15% of the time (ignoring reruns here so this is an overestimate)

Electron beamgas (JOB ID 1850) on the other hand has 2756 total jobs and the error shows up 1312 times or 1312/2756=47% of the time.

We need to bring this under 2-3% for the larger datasets at least because requesting more memory might mean more wait time at the queue.

-- Schedd: osg-eic.jlab.org : <129.57.198.181:9615?... @ 01/31/25 11:40:07
OWNER   BATCH_NAME                                                                                                 SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
eicprod 25.01.1/epic_craterlake/DIS_NC_18x275_minQ2=1-2025-01-18T00:43-05:00.csv                                  1/18 00:43  10204      2      _     14  10220 1837.141-9881
eicprod 25.01.1/epic_craterlake/DIS_NC_18x275_minQ2=10-2025-01-18T00:44-05:00.csv                                 1/18 00:44  10825      1      _     94  10920 1838.250-9933
eicprod 25.01.1/epic_craterlake/DIS_NC_5x41_minQ2=1-2025-01-18T08:56-05:00.csv                                    1/18 08:56   2916      _      _      2   2918 1841.1067-1244
eicprod 25.01.1/epic_craterlake/DIS_NC_5x41_minQ2=10-2025-01-18T08:56-05:00.csv                                   1/18 08:56   3447      _      _      3   3450 1842.2207-3263
eicprod 25.01.1/epic_craterlake/DIS_NC_5x41_minQ2=100-2025-01-18T08:56-05:00.csv                                  1/18 08:56   4257      1      _      2   4260 1843.243-2606
eicprod 25.01.1/epic_craterlake/GETaLM1.0.0-1.0_ElectronBeamGas_10GeV_foam_emin10keV-2025-01-21T19:37-05:00.csv   1/21 19:37   2756      _      _     10   2766 1850.164-2487
eicprod 25.01.1/epic_craterlake/pythia8.306-1.0_ProtonBeamGas_100GeV-2025-01-21T19:38-05:00.csv                   1/21 19:38   4877      _      _     23   4900 1851.794-4843
eicprod 25.01.1/epic_craterlake/pythia8.306-1.0_ProtonBeamGas_275GeV-2025-01-21T19:39-05:00.csv                   1/21 19:39  12764      _      _    136  12900 1852.301-12758
eicprod 25.01.1/epic_craterlake/e--2025-01-29T08:17-05:00.csv                                                     1/29 08:17   1234      _      _     66   1300 1854.38-1126
eicprod 25.01.1/epic_craterlake/e+-2025-01-29T08:18-05:00.csv                                                     1/29 08:18   1426      _      _    109   1535 1855.42-1462
eicprod 25.01.1/epic_craterlake/gamma-2025-01-29T08:18-05:00.csv                                                  1/29 08:18   2824      _      _    116   2940 1856.0-2923
eicprod 25.01.1/epic_craterlake/kaon--2025-01-29T08:18-05:00.csv                                                  1/29 08:18    730      _      _     67    797 1857.9-766
eicprod 25.01.1/epic_craterlake/kaon+-2025-01-29T08:18-05:00.csv                                                  1/29 08:18    887      _      _     86    973 1858.61-903
eicprod 25.01.1/epic_craterlake/pi--2025-01-29T08:18-05:00.csv                                                    1/29 08:18    625      _      _     78    703 1859.33-696
eicprod 25.01.1/epic_craterlake/pi0-2025-01-29T08:18-05:00.csv                                                    1/29 08:18   2563      _      _    329   2892 1860.38-2886
eicprod 25.01.1/epic_craterlake/pi+-2025-01-29T08:19-05:00.csv                                                    1/29 08:19    660      _      _     91    751 1861.16-676
eicprod 25.01.1/epic_craterlake/proton-2025-01-29T08:19-05:00.csv                                                 1/29 08:19    599      _      _     32    631 1862.25-612
eicprod 25.01.1/epic_craterlake/neutron-2025-01-29T08:19-05:00.csv                                                1/29 08:19    417      _      _     65    482 1863.17-280

Total for query: 1327 jobs; 0 completed, 0 removed, 0 idle, 4 running, 1323 held, 0 suspended
Total for eicprod: 1327 jobs; 0 completed, 0 removed, 0 idle, 4 running, 1323 held, 0 suspended
Total for all users: 1327 jobs; 0 completed, 0 removed, 0 idle, 4 running, 1323 held, 0 suspended

[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1837/*.log | awk -F":" '{print $1}' | uniq | wc -l
1571
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1838/*.log | awk -F":" '{print $1}' | uniq | wc -l
1070
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1841/*.log | awk -F":" '{print $1}' | uniq | wc -l
76
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1842/*.log | awk -F":" '{print $1}' | uniq | wc -l
173
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1843/*.log | awk -F":" '{print $1}' | uniq | wc -l
255
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1850/*.log | awk -F":" '{print $1}' | uniq | wc -l
1312
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1851/*.log | awk -F":" '{print $1}' | uniq | wc -l
934
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1852/*.log | awk -F":" '{print $1}' | uniq | wc -l
1960
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1854/*.log | awk -F":" '{print $1}' | uniq | wc -l
58
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1855/*.log | awk -F":" '{print $1}' | uniq | wc -l
104
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1856/*.log | awk -F":" '{print $1}' | uniq | wc -l
176
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1857/*.log | awk -F":" '{print $1}' | uniq | wc -l
2
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1858/*.log | awk -F":" '{print $1}' | uniq | wc -l
58
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1859/*.log | awk -F":" '{print $1}' | uniq | wc -l
77
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1860/*.log | awk -F":" '{print $1}' | uniq | wc -l
78
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1861/*.log | awk -F":" '{print $1}' | uniq | wc -l
42
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1862/*.log | awk -F":" '{print $1}' | uniq | wc -l
25
[eicprod@osg-eic CONDOR]$ grep -r "memory usage exceeded" osg_1863/*.log | awk -F":" '{print $1}' | uniq | wc -l
60
[eicprod@osg-eic CONDOR]$

@wdconinc
Copy link
Contributor

For jobs that succeed, does prmon show leaky behavior? e.g. rising memory use over the course of a job?

@kkauder
Copy link
Contributor

kkauder commented Jan 31, 2025

Just as a side note, the number of status==1 particles in the input isn't excessive (<~60).

@rahmans1
Copy link
Contributor Author

rahmans1 commented Feb 2, 2025

For jobs that succeed, does prmon show leaky behavior? e.g. rising memory use over the course of a job?

Haven't looked in detail yet. But it seems like setting the memory upper limit to 3 G gives much better failure rate for the SIDIS 10x100 (1.9%) and 5x41 (0.4%) set respectively

-- Schedd: osgsub01.sdcc.bnl.gov : <130.199.185.8:9618?... @ 02/02/25 16:19:54
OWNER   BATCH_NAME                                                            SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
rahmans 25.01.1/epic_craterlake/pythia6-eic-1.0.0_10x10                      2/2  09:49  6211   3681    488     12  33150 461.20-1039

Total for query: 4181 jobs; 0 completed, 0 removed, 488 idle, 3681 running, 12 held, 0 suspended
Total for rahmans: 4181 jobs; 0 completed, 0 removed, 488 idle, 3681 running, 12 held, 0 suspended
Total for all users: 4181 jobs; 0 completed, 0 removed, 488 idle, 3681 running, 12 held, 0 suspended
[rahmans1@ap23 job_submission_condor]$ condor_q 25202904


-- Schedd: ap23.uc.osg-htc.org : <192.170.231.144:9618?... @ 02/02/25 15:25:18
OWNER    BATCH_NAME                                  SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
rahmans1 25.01.1/epic_craterlake/pythia6-eic-1.0.   1/30 13:56  13956   1108    448     58  16700 25202904.206-1556

Total for query: 1614 jobs; 0 completed, 0 removed, 448 idle, 1108 running, 58 held, 0 suspended
Total for all users: 66137 jobs; 0 completed, 0 removed, 57596 idle, 7465 running, 1076 held, 0 suspended

@wdconinc
Copy link
Contributor

wdconinc commented Feb 2, 2025

Haven't looked in detail yet. But it seems like setting the memory upper limit to 3 G gives much better failure rate

Well, yeah, that's not surprising. But it doesn't resolve the problem and is not sustainable. We have to figure out why, and only then make a decision about whether it is necessary/unavoidable to increase memory limit.

@veprbl
Copy link
Member

veprbl commented Feb 5, 2025

Looks like most of the memory for baseline DIS CC is consumed in DD4hep, but nothing comes to mind in terms of changes to the geometry.

A bit of memory can be freed by applying #1729, I wonder if that is enough to get us through 25.02.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants