-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files deleted by EOS after successful stageOut at T0 #12087
Comments
Here's what we observed on EOS side: A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file. Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue. |
@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture. |
Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:
What is interesting for you is the |
for completeness sake, this is the
|
We are currently running a replay with So far, no issues |
This is the PR disabling write recovery on the replay: |
That's great news, thanks ! |
@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see: however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files. |
We are writing to T0 Disk, so it seems this is not working properly. I've checked the final
This is an example of the command in the running replay:
|
Impact of the bug
T0 WMAgent, Probably WMAgent too
Describe the bug
This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of
xrdcp
with #12058 did not help as the commandis successful returning 0.How to reproduce it
This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.
Additional context and error message
Log of the successful stage out of a later deleted file:
The text was updated successfully, but these errors were encountered: