Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files deleted by EOS after successful stageOut at T0 #12087

Open
germanfgv opened this issue Aug 29, 2024 · 11 comments
Open

Files deleted by EOS after successful stageOut at T0 #12087

germanfgv opened this issue Aug 29, 2024 · 11 comments

Comments

@germanfgv
Copy link
Contributor

Impact of the bug
T0 WMAgent, Probably WMAgent too

Describe the bug
This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of xrdcp with #12058 did not help as the commandis successful returning 0.

How to reproduce it
This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.

Additional context and error message
Log of the successful stage out of a later deleted file:

2024-08-28 07:39:29,312:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root?eos.app=cmst0

2024-08-28 07:39:29,312:INFO:StageOutImpl:Creating output directory...
2024-08-28 07:39:29,313:INFO:StageOutImpl:Running the stage out...
2024-08-28 07:41:27,626:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 48638573
xrdcp exit code: 0
Remote File Size is: 48638573
Local File Checksum is: fd4eb3b5
Remote File Checksum is: fd4eb3b5

stderr: 
@germanfgv
Copy link
Contributor Author

@ccaffy @amadio could you please describe your hypothesis about the cause of this issue?

@ccaffy
Copy link

ccaffy commented Aug 29, 2024

Here's what we observed on EOS side:

A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file.
xrdcp therefore considered the transfer as a success and closed all the connections. The first connection was therefore closed but as it failed to transfer the file properly, the file got deleted.

Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue.

@germanfgv
Copy link
Contributor Author

@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture.

@ccaffy
Copy link

ccaffy commented Aug 29, 2024

Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:

log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:8236@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871893&otms=106&cts=1724871994&ctms=473&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1315175702&wb_min=261398&wb_max=1048576&wb_sigma=22211.51&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=98.52&iot=101366.961&idt=1501.294&lrt=0.000&lrvt=0.000&lwt=87318.619&ot=7.571&ct=12539.477&rt=5213.35&rvt=0.00&wt=4691.41&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=1.213123:1617@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871893&otms=112&cts=1724871994&ctms=472&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=15.63&iot=101359.914&idt=85518.695&lrt=0.000&lrvt=0.000&lwt=8995.287&ot=0.558&ct=6845.374&rt=6362.97&rvt=0.00&wt=8501.17&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=3.213123:2336@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871771&otms=690&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.12&iot=265308.750&idt=264987.750&lrt=0.000&lrvt=0.000&lwt=319.672&ot=0.617&ct=0.724&rt=0.00&rvt=0.00&wt=298.37&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:2031@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871771&otms=689&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=51380224&wb_min=1048576&wb_max=1048576&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=99.94&iot=265310.062&idt=161.072&lrt=0.000&lrvt=0.000&lwt=114506.528&ot=2.208&ct=150640.255&rt=0.00&rvt=0.00&wt=189.50&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0

What is interesting for you is the ots (open timestamp), cts (close timestamp), td (connection identifier), idt (idle time which is the time in ms where nothing happened) and delete_on_close which if equal to 1, the file will be deleted.

@germanfgv
Copy link
Contributor Author

for completeness sake, this is the wmagentJob.log extract for the stage out of this file:

2024-08-28 19:02:51,338:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
2024-08-28 19:02:51,339:INFO:StageOutMgr:===> Attempting 1 Stage Outs
2024-08-28 19:02:51,340:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0

2024-08-28 19:02:51,341:INFO:StageOutImpl:Creating output directory...
2024-08-28 19:02:51,353:INFO:StageOutImpl:Running the stage out...
2024-08-28 19:06:34,882:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 1357118742
xrdcp exit code: 0
Remote File Size is: 1357118742
Local File Checksum is: d9b2f991
Remote File Checksum is: d9b2f991

stderr: 
2024-08-28 19:06:34,883:INFO:StageOutMgr:attempting stageOut
2024-08-28 19:06:34,883:INFO:StageOutMgr:===> Stage Out Successful: {'LFN': '/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root', 'PFN': 'root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2
024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0', 'PNN': 'T0_CH_CERN_Disk', 'StageOutCommand': 'xrdcp', 'Checksums': {'adler32': 'd9b2f991', 'cksum': '2524514726'}, 'StageOutReport': []}

@germanfgv
Copy link
Contributor Author

@germanfgv
Copy link
Contributor Author

We are currently running a replay with XRD_WRITERECOVERY=0

This is the PR disabling write recovery on the replay:
germanfgv#16

@ccaffy
Copy link

ccaffy commented Aug 29, 2024

That's great news, thanks !

@amaltaro
Copy link
Contributor

@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see:
https://gitlab.cern.ch/SITECONF/T0_CH_CERN/-/blob/master/JobConfig/site-local-config.xml#L14
https://gitlab.cern.ch/SITECONF/T2_CH_CERN/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L16

however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files.

@germanfgv
Copy link
Contributor Author

We are writing to T0 Disk, so it seems this is not working properly. I've checked the final xrdcp command and it does not include the proper env variable. This is an example of the cp command that we usually generate:

xrdcp --force --nopbar --cksum adler32:2d8dc345  "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0" 

This is an example of the command in the running replay:

env XRD_WRITERECOVERY=0 xrdcp --force --nopbar  "/srv/job/WMTaskSpace/logArch1/logArchive.tar.gz"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/8/29/PromptReco_Run382686_EGamma0_Tier0_REPLAY_2024_ID240829104828_v29104828/Reco/0000/0/a96807d0-7d81-4221-8229-9288d4e1a1d3-0-0-logArchive.tar.gz?eos.app=cmst0" 

@amadio
Copy link

amadio commented Aug 31, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants