Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSUnmerged: Manual cleanup of /store/unmerged area at T2_CH_CERN #11904

Closed
todor-ivanov opened this issue Feb 19, 2024 · 11 comments · May be fixed by #11916
Closed

MSUnmerged: Manual cleanup of /store/unmerged area at T2_CH_CERN #11904

todor-ivanov opened this issue Feb 19, 2024 · 11 comments · May be fixed by #11916

Comments

@todor-ivanov
Copy link
Contributor

Impact of the bug
T2_CH_CERN site

Describe the bug
As a consequence of the bug: #11893 we have accumulated almost 1 PB of data in the /store/unmerged area at T2_CH_CERN . The combination of the previously prolonged fix of the Gfal recursive errors leads to a lot of empty directories being accumulating as well. This two govern the necessity of initial manual intervention for cleaning those either trough

  • a standalone run of the MSUnmerged at T2_CH_CERN
  • or by manually implementing the logic of filtering out the protected LFNs and doing the deletions directly on an /eos mount point.

For the later we have to either ask the VOC for delete permissions at /eos/cms/store/unmerged area or provide a list of deletable objects and let him or DM Team to do the deletions

How to reproduce it
N/A

Expected behavior
Follow up on the manual intervention and delete all unneeded objects

Additional context and error message
None

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Feb 20, 2024

And Here is the Jira ticket for requesting access to the /store/unmerged area at CERN for doing this cleanup process: https://its.cern.ch/jira/projects/CMSVOC/issues/CMSVOC-530

Once we are done with the manual cleanup we should come back and crosscheck if these permission errors from this Jira ticket are still present: https://its.cern.ch/jira/browse/CMSVOC-491. As mentioned in the ticket as well we currently cannot even load the the needed objects in memory and run the MSUnmerged service for T2_CH_CERN, because whenever we stumble on the site the watchdog kills the thread [1]

[1]

reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log:2024-02-17 10:40:31,818:INFO:MSUnmerged: RSE: T2_CH_CERN Reading rse data from MongoDB.
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log:2024-02-17 10:40:31,820:INFO:MSUnmerged: RSE: T2_CH_CERN With old consistency record in Rucio Consistency Monitor. But the RSE has NOT been fully cleaned during the last Rucio Consistency Monitor polling cycle.Retrying cleanup in the current run.
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log:2024-02-17 10:40:31,820:DEBUG:RucioConMon: Fetching data from files?rse=T2_CH_CERN&format=json, with args None
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log-2024-02-17 10:40:31,820:DEBUG:Service: getData: 
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log:   url: files?rse=T2_CH_CERN&format=json
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log-   verb: GET
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log-   incoming_headers: {}
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log-   data: {}
reqmgr2ms-20240217-ms-unmer-t2t3-cc9454865-mrk4n.log-[17/Feb/2024:10:40:43]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

@amaltaro
Copy link
Contributor

@todor-ivanov before getting write access to the unmerged area, why don't you temporarily increase the resource requirements of the service? Such that it can go through the first run and get back into a more normal load.

Please also help me understanding why we have accumulated so much unneeded data at CERN. Is it:
a) because we were never able to remove directories, hence leaving directories behind since day 1.
b) because at some point we were no longer able to remove directories due to some misconfiguration (change in the MSUnmerged certificate, perhaps?)
c) because it's been long since we managed to remove anything at T2_CH_CERN
d) because of the migration to webdav
e) or all of the above?

@todor-ivanov
Copy link
Contributor Author

Only the the combination of the last 3:

b) because at some point we were no longer able to remove directories due to some misconfiguration (change in the MSUnmerged certificate, perhaps?)
c) because it's been long since we managed to remove anything at T2_CH_CERN
d) because of the migration to webdav

We did have a certificate issue for eos in the past. But definitely not from day one. And here is the Jira ticket for that : https://its.cern.ch/jira/browse/CMSVOC-491 It is not clear to me if anything has been done from the VOC in order to re-map this certificate to an account which has the proper write access in the unmerged area . Now we cannot test it until we clean some space, because of the WATCHDOG killings that happen in the service.

@amaltaro
Copy link
Contributor

Given the lack of communication in the ticket #491 above, should we re-open it?

For the watchdog kill, are you seeing it in kubernetes or in your own environment?

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Feb 20, 2024

For the watchdog kill, are you seeing it in kubernetes or in your own environment?

It is our Watchdog who is killing the process with SIGSEGV (sig 11). Which is basically a sign of memory depletion.

I do not have the permissions to reopen this ticket, I hope @arooshap have: https://its.cern.ch/jira/browse/CMSVOC-491?focusedId=6244590&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-6244590

@klannon
Copy link

klannon commented Feb 26, 2024

I don't understand why this is a high priority issue. This seems like it should be an "Operations" issue. We didn't plan for this. It's extra and only being considered because of it's operational impact. @todor-ivanov, if this is correct can you please recategorize this as an Operations issue?

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Feb 27, 2024

Hi @klannon I did categorize it and labeled it as Operations from the very beginning. The only think I can do is to completely remove the QPrio field and to hope it would solve the miscategorization problem

@klannon
Copy link

klannon commented Feb 29, 2024

@todor-ivanov That is not correct. You need to set QPrio to "Operations." I have just done that for this issue.

@todor-ivanov
Copy link
Contributor Author

Finally CERN is clean: [1]
For more technical details on how the actual cleanup process went, with the code improvements I did, please read the information provided with the associated (but not merged) PR: #11916 and this comment in particular: #11916 (comment)

I am closing this issue now.

[1]
https://monit-grafana.cern.ch/goto/BrePj6oSg?orgId=11

EosCmsStoreUnmerged_2024-03-01_16-09-55

@amaltaro
Copy link
Contributor

amaltaro commented Mar 1, 2024

Awesome! Should we reinsert T2_CH_CERN into the MSUnmerged configuration - once the unmerged consistency monitor dump is in a more manageable size?

@todor-ivanov
Copy link
Contributor Author

I think we can do it immediately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants