Files deleted by EOS after successful stageOut at T0 #12087

germanfgv · 2024-08-29T07:06:06Z

Impact of the bug
T0 WMAgent, Probably WMAgent too

Describe the bug
This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of xrdcp with #12058 did not help as the commandis successful returning 0.

How to reproduce it
This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.

Additional context and error message
Log of the successful stage out of a later deleted file:

2024-08-28 07:39:29,312:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root?eos.app=cmst0

2024-08-28 07:39:29,312:INFO:StageOutImpl:Creating output directory...
2024-08-28 07:39:29,313:INFO:StageOutImpl:Running the stage out...
2024-08-28 07:41:27,626:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 48638573
xrdcp exit code: 0
Remote File Size is: 48638573
Local File Checksum is: fd4eb3b5
Remote File Checksum is: fd4eb3b5

stderr:

The text was updated successfully, but these errors were encountered:

germanfgv · 2024-08-29T07:07:08Z

@ccaffy @amadio could you please describe your hypothesis about the cause of this issue?

ccaffy · 2024-08-29T09:17:13Z

Here's what we observed on EOS side:

A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file.
xrdcp therefore considered the transfer as a success and closed all the connections. The first connection was therefore closed but as it failed to transfer the file properly, the file got deleted.

Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue.

germanfgv · 2024-08-29T09:19:44Z

@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture.

ccaffy · 2024-08-29T10:10:27Z

Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:

log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:8236@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871893&otms=106&cts=1724871994&ctms=473&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1315175702&wb_min=261398&wb_max=1048576&wb_sigma=22211.51&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=98.52&iot=101366.961&idt=1501.294&lrt=0.000&lrvt=0.000&lwt=87318.619&ot=7.571&ct=12539.477&rt=5213.35&rvt=0.00&wt=4691.41&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=1.213123:1617@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871893&otms=112&cts=1724871994&ctms=472&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=15.63&iot=101359.914&idt=85518.695&lrt=0.000&lrvt=0.000&lwt=8995.287&ot=0.558&ct=6845.374&rt=6362.97&rvt=0.00&wt=8501.17&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=3.213123:2336@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871771&otms=690&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.12&iot=265308.750&idt=264987.750&lrt=0.000&lrvt=0.000&lwt=319.672&ot=0.617&ct=0.724&rt=0.00&rvt=0.00&wt=298.37&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:2031@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871771&otms=689&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=51380224&wb_min=1048576&wb_max=1048576&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=99.94&iot=265310.062&idt=161.072&lrt=0.000&lrvt=0.000&lwt=114506.528&ot=2.208&ct=150640.255&rt=0.00&rvt=0.00&wt=189.50&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0

What is interesting for you is the ots (open timestamp), cts (close timestamp), td (connection identifier), idt (idle time which is the time in ms where nothing happened) and delete_on_close which if equal to 1, the file will be deleted.

germanfgv · 2024-08-29T10:26:35Z

for completeness sake, this is the wmagentJob.log extract for the stage out of this file:

2024-08-28 19:02:51,338:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
2024-08-28 19:02:51,339:INFO:StageOutMgr:===> Attempting 1 Stage Outs
2024-08-28 19:02:51,340:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0

2024-08-28 19:02:51,341:INFO:StageOutImpl:Creating output directory...
2024-08-28 19:02:51,353:INFO:StageOutImpl:Running the stage out...
2024-08-28 19:06:34,882:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 1357118742
xrdcp exit code: 0
Remote File Size is: 1357118742
Local File Checksum is: d9b2f991
Remote File Checksum is: d9b2f991

stderr: 
2024-08-28 19:06:34,883:INFO:StageOutMgr:attempting stageOut
2024-08-28 19:06:34,883:INFO:StageOutMgr:===> Stage Out Successful: {'LFN': '/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root', 'PFN': 'root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2
024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0', 'PNN': 'T0_CH_CERN_Disk', 'StageOutCommand': 'xrdcp', 'Checksums': {'adler32': 'd9b2f991', 'cksum': '2524514726'}, 'StageOutReport': []}

germanfgv · 2024-08-29T10:29:31Z

We are currently running a replay with XRD_WRITERECOVERY=0:
https://monit-grafana.cern.ch/d/t_jr45h7k/cms-tier0-replayid-monitoring?orgId=11&refresh=1m&var-Bin=5m&var-ReplayID=240829104828&var-JobType=All&var-WorkflowType=All

So far, no issues

germanfgv · 2024-08-29T11:02:01Z

We are currently running a replay with XRD_WRITERECOVERY=0

This is the PR disabling write recovery on the replay:
germanfgv#16

ccaffy · 2024-08-29T11:29:54Z

That's great news, thanks !

amaltaro · 2024-08-29T14:46:47Z

@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see:
https://gitlab.cern.ch/SITECONF/T0_CH_CERN/-/blob/master/JobConfig/site-local-config.xml#L14
https://gitlab.cern.ch/SITECONF/T2_CH_CERN/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L16

however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files.

germanfgv · 2024-08-29T15:02:38Z

We are writing to T0 Disk, so it seems this is not working properly. I've checked the final xrdcp command and it does not include the proper env variable. This is an example of the cp command that we usually generate:

xrdcp --force --nopbar --cksum adler32:2d8dc345  "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0"

This is an example of the command in the running replay:

env XRD_WRITERECOVERY=0 xrdcp --force --nopbar  "/srv/job/WMTaskSpace/logArch1/logArchive.tar.gz"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/8/29/PromptReco_Run382686_EGamma0_Tier0_REPLAY_2024_ID240829104828_v29104828/Reco/0000/0/a96807d0-7d81-4221-8229-9288d4e1a1d3-0-0-logArchive.tar.gz?eos.app=cmst0"

amadio · 2024-08-31T10:29:43Z

FYI https://its.cern.ch/jira/browse/EOS-6195

germanfgv mentioned this issue Aug 29, 2024

Replay test for CMSSW_14_0_15 dmwm/T0#4986

Merged

germanfgv added the Tier0-systems label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files deleted by EOS after successful stageOut at T0 #12087

Files deleted by EOS after successful stageOut at T0 #12087

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

amaltaro commented Aug 29, 2024

germanfgv commented Aug 29, 2024

amadio commented Aug 31, 2024

Files deleted by EOS after successful stageOut at T0 #12087

Files deleted by EOS after successful stageOut at T0 #12087

Comments

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

germanfgv commented Aug 29, 2024

ccaffy commented Aug 29, 2024

amaltaro commented Aug 29, 2024

germanfgv commented Aug 29, 2024

amadio commented Aug 31, 2024