Add MSUnmerged initStandalone && Read AllUnmerged from file #11916

todor-ivanov · 2024-02-29T11:11:43Z

Fixes #11904

Status

Description

With the Current PR I provide an initialization script for running the MSUnmergred service standalone. It requires the service config files in order to be able to run it. Additional feature added is the ability to read all unmerged files from disk. This way we avoid uploading huge lists in the order of GBs in memory.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

service_configs
WMCore virtual environment:

cmsdmwmbot · 2024-02-29T11:25:07Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 59 warnings and errors that must be fixed
- 1 warnings
- 42 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14929/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-02-29T11:27:17Z

@amaltaro Take look at these two functions from the current PR:

They are basically exactly the same as the respective methods at the MSUnmerged class:

WMCore/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py

Line 599 in 9a703d6

def getUnmergedFiles(self, rse):

and

WMCore/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py

Line 665 in 9a703d6

def filterUnmergedFiles(self, rse):

The only difference is that they do all the job through io operations to disk instead of memory. When it comes to huge lists in the order of GBs this implroves the service performance and memory consumption tremendously. I was actually surprised to see the running speed did not degrade as much as I expected. I am considering adding this functionality as optional parameters to the already existing two methods. What do you think?

todor-ivanov · 2024-02-29T17:35:19Z

With my latest commit, I finally get the direct deletions with the os based library right. Together with the lfn deletions I now also update the rse counters, such that upon the RSE cleanup we will have everything tracked just as it was with gfal and we can upload to MongoDB the document for the finally cleaned RSE. This way the manual cleanup will not get into the way of properly tracking the cleanup history of the RSE itself and will not mess up with the cleanup sequence later when we the RSE is to be cleaned automatically again.

I have tested everything with 50 directories. So far so good. Tomorrow morning I'll shoot for cleaning the whole unmerged area at T2_CH_CERN

FYI @amaltaro

cmsdmwmbot · 2024-02-29T17:40:21Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 63 warnings and errors that must be fixed
- 8 warnings
- 82 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14933/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-03-01T12:16:13Z

@amaltaro No need for a detailed review. Just try to grasp the idea of the two big improvements this code provides:

The ability to reduce memory foot print on the cost of disk i/o operations: Add MSUnmerged initStandalone && Read AllUnmerged from file #11916 (comment)
The ability to drive deletions with os libraries - for the cases like T2_CH_CERN where we have a direct mount point to the data: cleanRSEOs

We may consider adding these functionalities to the main code as well. Refactoring them to become proper MSUnmerged methods is not too much work.

todor-ivanov · 2024-03-01T14:41:14Z

The initStandalnoe.py script just finished running. It took almost 3,5 h: [1]. The RSE is clean and the status document properly uploaded to MongoDB:

The detailed document (with the status of all directories deleted{failed|sucess} && protected): https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T2_CH_CERN&detail=True
The shortened document (including only timestamps and counters): https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T2_CH_CERN&detail=False

{"result": [
 {
  "wmcore_version": "2.3.1",
  "microservice_version": "2.3.1",
  "microservice": "MSManager",
  "query": "rse=T2_CH_CERN&detail=False",
  "rseData": [
    {
      "name": "T2_CH_CERN",
      "pfnPrefix": "/eos/cms",
      "rucioConMonStatus": null,
      "isClean": false,
      "timestamps": {
        "rseConsStatTime": "2023-11-02T16:06:26.189066",
        "prevStartTime": "2024-03-01T09:54:28.719355",
        "startTime": "2024-03-01T10:05:24.840614",
        "prevEndTime": "2024-03-01T09:56:10.504660",
        "endTime": "2024-03-01T13:23:18.124024"
      },
      "counters": {
        "totalNumFiles": 0,
        "totalNumDirs": 24501,
        "dirsToDelete": 23678,
        "filesToDelete": 0,
        "filesDeletedSuccess": 0,
        "filesDeletedFail": 0,
        "dirsDeletedSuccess": 23678,
        "dirsDeletedFail": 0,
        "gfalErrors": {}
      }
    }
  ]
}]}

NOTE: Please notice the mistaken field isClean: False. This is because I just forgot to update it before exiting the cleanup cycle and uploading the document at MongoDB. I am fixing it with my latest commit. The RSE is for real clear - All successfully deleted directories rseData->counters->dirsDeletedSuccess match the numbers in rseData->counters->dirsToDelete

Here: initStandalone-T2_CH_CERN-2024-03-01T11:05+01:00-RealRun.log
I also upload the full log from this cleanup process.

Here: #11904 (comment) is the report about the amount of space freed.

[1]

(WMCore.MSUnmergedStandalone) [user@unit01 srv]$ time ipython WMCore/src/python/WMCore/MicroService/MSUnmerged/initStandalone.py > initStandalone-`date -Im`-RealRun.log & 
[1] 1397433

(WMCore.MSUnmergedStandalone) [user@unit01 srv]$ 
real	201m11.981s
user	2m3.947s
sys	5m6.828s

[1]+  Done                    time ipython WMCore/src/python/WMCore/MicroService/MSUnmerged/initStandalone.py > initStandalone-`date -Im`-RealRun.log

amaltaro

Todor, for the smaller memory footprint improvements that you provided in here. I understand that the bulk of that implementation is in the getUnmergedfromFile and filterUnmergedFromFile methods, right?

In plain english, my understanding of the logic for parsing the files from the consistency monitoring is:

for each file from ConMon dump, short it to a given path length and add it to a set of LFN paths (dirs.allUnmerged field)
remove LFN paths from dirs.allUnmerged If they are also present in the protected list
then iterate over each LFN path in dirs.allUnmerged and
a) open the ConMon dump, and scan each file/line for the LFN path, yielding those lines

Which means, if we have 10k unique directories, we would open that file 10k times and scan it as a whole 10k times. Even though that is doable, I think we can come up with other options.

One of those could be:

either make a feature request to the Rucio ConMon to provide a sorted list of files, or open the ConMon dump and sort files by name (well, by LFN path)
slice the ConMon dump by X files/lines (e.g. 50k entries)
execute the MSUnmerged logic on each slice separately

What do you think?

todor-ivanov · 2024-03-07T20:18:30Z

hi @amaltaro

I understand that the bulk of that implementation is in the getUnmergedfromFile and filterUnmergedFromFile methods, right?

Correct

for each file from ConMon dump, short it to a given path length and add it to a set of LFN paths (dirs.allUnmerged field)

Yes but this is not new this is how things work even right now.

remove LFN paths from dirs.allUnmerged If they are also present in the protected list

No this is not correct. The lists in dirs.allUnmerged are never touched, but those are not too long at the first place. Once the huge file lists are reduced to few level up in the tree and are merged in a set (from the previous step) we usually end up with few hundreds (at most thousand) records. we do not have sites which hold more than few hundreds workflows and in this process we end up with less than 10 records per workflow on average. So this is really not a big record. And not changed with this code.

then iterate over each LFN path in dirs.allUnmerged and
a) open the ConMon dump, and scan each file/line for the LFN path, yielding those lines

Again no. If you refer to this: https://github.com/dmwm/WMCore/pull/11916/files#diff-d829adc30e84e637ed37c8770f8a0f84eb357772a5f210b79c8ba0f385091dfdR353-R360

    # Now create the filters for rse['files']['toDelete'] - those should be pure generators
    # A simple generator:
    def genFunc(pattern, filePath):
        with open(filePath, 'r') as fd:
            for line in fd:
                if line.startswith(pattern):
                    yield line.rstrip()

These are simple closures which are used as file list generators (indeed by opening and parsing the main file holding the full list of lfns at the site), but none of those are actually executed. Those are only function definitions recorded in the dictionary with toplevel directory paths. The key is the toplevel directory itself, the value is a pointer referring to the proper address in memory where those function definitions are. No file is actually opened during this process of filling the records in the dictionary. No file descriptors are created at this stage. We refer to those for executing the actual function and opening the main file for iterating through it (in order to filter the lists of lfns), only for those deletions for which we fail to delete the topmost path due to errors of the sort : Directory not empty. Then (and only then) we need to enter the recursive operations for purging the whole tree starting from the leafs, and only then we need to open and read the file with lfns. During the whole process of cleaning CERN it did not happen even once. But of course there are going to be sites where it will happen. And again this is not new. This was exactly the same mechanism in the past, the only difference is that we were holding the huge list of lfns in the memory and we were iterating through those through the pointer to this list in memory. Now we do the same, but we just keep a file descriptor instead. And we sacrifice the fast access to memory for the i/o operations to disk (only for extreme cases and definitely not for all)

Which means, if we have 10k unique directories, we would open that file 10k times and scan it as a whole 10k times.

No we are not doing that. I explained the process above.

either make a feature request to the Rucio ConMon to provide a sorted list of files, or open the ConMon dump and sort files by name (well, by LFN path)

That may help. And we can use many search optimization algorithms. But indeed starting with a sorted list is much faster.

slice the ConMon dump by X files/lines (e.g. 50k entries)
execute the MSUnmerged logic on each slice separately

This was my exact approach at the beginning. And I ran into complications of properly doing the bookeeping in the database. those were not that difficult to sort, but would need require additional code. At the end I went boldly for working with the whole (the largest seen so far) list of lfns for CERN and it did not take more than few hundreds megabytes in memory, even though the file itself was in the order of Gb.

https://github.com/dmwm/WMCore/pull/11916/files#diff-d829adc30e84e637ed37c8770f8a0f84eb357772a5f210b79c8ba0f385091dfdR353-R360

cmsdmwmbot · 2024-09-30T20:59:29Z

Can one of the admins verify this patch?

Add MSUnmerged initStandalone

e5e9322

todor-ivanov added the PR: Do not merge yet label Feb 29, 2024

Add cleanRSEOs method to MSUnmerged

9bf4f9a

todor-ivanov requested a review from amaltaro March 1, 2024 12:09

todor-ivanov mentioned this pull request Mar 1, 2024

MSUnmerged: Manual cleanup of /store/unmerged area at T2_CH_CERN #11904

Closed

amaltaro reviewed Mar 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MSUnmerged initStandalone && Read AllUnmerged from file #11916

Add MSUnmerged initStandalone && Read AllUnmerged from file #11916

todor-ivanov commented Feb 29, 2024 •

edited

Loading

cmsdmwmbot commented Feb 29, 2024

todor-ivanov commented Feb 29, 2024

todor-ivanov commented Feb 29, 2024 •

edited

Loading

cmsdmwmbot commented Feb 29, 2024

todor-ivanov commented Mar 1, 2024

todor-ivanov commented Mar 1, 2024 •

edited

Loading

amaltaro left a comment

todor-ivanov commented Mar 7, 2024

cmsdmwmbot commented Sep 30, 2024

Add MSUnmerged initStandalone && Read AllUnmerged from file #11916

Are you sure you want to change the base?

Add MSUnmerged initStandalone && Read AllUnmerged from file #11916

Conversation

todor-ivanov commented Feb 29, 2024 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Feb 29, 2024

todor-ivanov commented Feb 29, 2024

todor-ivanov commented Feb 29, 2024 • edited Loading

cmsdmwmbot commented Feb 29, 2024

todor-ivanov commented Mar 1, 2024

todor-ivanov commented Mar 1, 2024 • edited Loading

amaltaro left a comment

Choose a reason for hiding this comment

todor-ivanov commented Mar 7, 2024

cmsdmwmbot commented Sep 30, 2024

todor-ivanov commented Feb 29, 2024 •

edited

Loading

todor-ivanov commented Feb 29, 2024 •

edited

Loading

todor-ivanov commented Mar 1, 2024 •

edited

Loading