Consume raw/generator unmerged dump data in MSUnmerged #12059

amaltaro · 2024-07-29T20:37:09Z

Fixes #12061

Status

In development

Description

The scope of this PR has inflated quite a lot, as I started just investigating the adoption of compressed RucioConMon data into MSUnmerged. Summary of changes are:

make a new log record as an RSE goes through the pipeline
modified RucioConMon to actually stream/generate data when it's retrieved in binary mode (raw data compressed)
tweaked WMStatsServer to actually return a unique list of protected LFNs
by default, make MSUnmerged retrieve compressed data from RucioConMon (zipped option)
removed MSUnmerged.filterUnmergedFiles method - logic now embedded in getUnmergedFiles and _isDeletable
refactored cleanRSE method by first trying to delete the whole directory. If it fails, then list all the content in that directory and delete its content by slices. Finally, try to remove the (now) empty directory.
provided a new method called _listDir to list root files (only) inside a directory
lastly, provides a script called test_gfal.py to test directory removal with gfal (simulating same similar behavior as MSUnmerged)

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Gzipped support was added with this PR:
#11142

but never adopted by MSUnmerged. With this PR, we actually adopt it.

External dependencies / deployment changes

Gzipped data is not currently functional:

>>> allUnmerged = rucioConMon.getRSEUnmerged("T1_ES_PIC_Disk", zipped=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 127, in getRSEUnmerged
    rseUnmerged = self._getResultZipped(uri,  callname=callname, clearCache=True)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 92, in _getResultZipped
    data = self._getResult(uri, callname, clearCache, args, binary=True)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 73, in _getResult
    results = gzip.decompress(istream.read())
  File "/usr/local/lib/python3.8/gzip.py", line 548, in decompress
    return f.read()
  File "/usr/local/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/usr/local/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():
  File "/usr/local/lib/python3.8/gzip.py", line 427, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'/s')

cmsdmwmbot · 2024-07-29T20:52:54Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 7 warnings
- 36 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15125/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-07-29T21:15:05Z

I have this patch applied to the production pod, named ms-unmer-t1-7dcf5f577b-dcj4w and it is also using the following configuration change (inside the pod):

data.skipRSEs = ['T1_US_FNAL_Disk', "T1_DE_KIT_Disk", "T1_RU_JINR_Disk", "T1_UK_RAL_Disk", "T1_IT_CNAF_Disk"]

At least CNAF and JINR are right now failing with permission issues.

cmsdmwmbot · 2024-07-29T21:20:51Z

Jenkins results:

Python3 Unit tests: succeeded
Python3 Pylint check: succeeded
- 7 warnings
- 36 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15126/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-07-31T15:42:31Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 7 warnings
- 59 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15128/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-07-31T15:47:51Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 59 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15129/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-08-01T19:11:01Z

The auth/authz issues that we had before were actually coming from a missing environment variable in the manage script (X509_USER_PROXY). Fix has been provided here: dmwm/CMSKubernetes#1532

On what concerns FNAL, I have been testing both the json and the raw format, and the memory RAM allocation gain is not very substantial (about 20%), as one can see with a simple script I wrote:

(WMAgent-2.3.4) [***@vocms0282:install]$ python get_unmer.py 
Size of all unmerged in unzipped format: 2489 MB and type <class 'list'>
['/store/unmerged/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/BParking_106X_upgrade2018_realistic_v16_L1v1-v1/2560000/CD81A5B4-B20C-2840-92B5-8FE53886885C.root', '/store/unmerged/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/BParking_106X_upgrade2018_realistic_v16_L1v1-v1/2560001/0DB454A7-E044-304B-AA56-2FBE12EC156E.root']

Size of all unmerged in zipped format: 1946 MB and type <class 'bytes'>
b'/s'

Lastly, in terms of network traffic pulling the unmerged dump for FNAL, we can see an outstanding improvement both in time and data. Here are the results:

(WMAgent-2.3.4) [cmst1@vocms0282:install]$ scurl "https://cmsweb.cern.ch/rucioconmon/unmerged/files?rse=T1_US_FNAL_Disk&format=json" > out_FNAL.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1984M    0 1984M    0     0  71.6M      0 --:--:--  0:00:27 --:--:-- 64.3M

(WMAgent-2.3.4) [cmst1@vocms0282:install]$ scurl "https://cmsweb.cern.ch/rucioconmon/unmerged/files?rse=T1_US_FNAL_Disk&format=raw" > out_FNAL.gzip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  225M    0  225M    0     0   162M      0 --:--:--  0:00:01 --:--:--  162M

cmsdmwmbot · 2024-08-01T22:04:05Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 58 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15136/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-05T20:15:04Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 68 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15141/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-05T21:58:56Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 69 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15142/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-06T04:29:52Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests no longer failing
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 68 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15143/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-21T20:38:26Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 7 warnings
- 69 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15162/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-21T20:44:02Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
- 1 tests added
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 7 warnings
- 69 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15161/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-22T02:39:26Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 7 warnings
- 71 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15163/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-22T03:21:17Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
- 3 tests no longer failing
- 1 tests added
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 7 warnings
- 71 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15164/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-27T16:31:49Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests added
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15168/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-28T14:50:11Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15173/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-28T21:31:00Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15174/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-30T01:19:59Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15176/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-08-30T22:11:03Z

Jenkins results:

Python3 Unit tests: failed
- 7 tests deleted
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15181/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-05T21:04:12Z

Jenkins results:

Python3 Unit tests: failed
- 7 tests deleted
- 1 tests no longer failing
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 7 warnings
- 70 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15191/artifact/artifacts/PullRequestReport.html

filterUnmergedFiles method no longer exists Fix check for isDeletable Fix key name for dirsDeletedFail check if ctx object exist before freeing it

temporarily remove integration tag for unit tests fix RucioConMon unit test fix MSUnmerged unit tests resolve MSUnmerged unit tests

cmsdmwmbot · 2024-10-02T22:58:07Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 9 warnings
- 81 comments to review
Pylint py3k check: failed
- 2 errors and warnings that should be fixed
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15275/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-10-04T03:36:21Z

I might have to do some polishing, but I think the bulk logic is in place now and I welcome any feedback.

amaltaro · 2024-10-04T03:37:06Z

test/python/WMCore_t/MicroService_t/MSUnmerged_t/test_gfal.py

@@ -0,0 +1,42 @@
+#!/usr/bin/env python
+import logging


SELF-REMINDER: This script will be removed before merging this PR.

vkuznet

The code seems very complex to me and very hard to follow. As such to properly define its logic someone needs to dig very deep into code flow. I suggest to further refactor the code into smaller functions with clear scope, e.g. if function is given an fobj which can be either a file or dir, the code can be re-organized to use concurrent patter on nested dir structure. The function behavior can be well defined with respect to its action, e.g.

for directory get list of its content and call itself
for a file, unlink the file and return the status

Without knowing exact logic how clean-up procedure should be done, which exceptions should be made, the review of the logic is very cumbersome.

vkuznet · 2024-10-09T12:42:49Z

src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py

-            if dirLfn in rse['dirs']['deletedSuccess']:
-                self.logger.info("RSE: %s, dir: %s already successfully deleted.", rse['name'], dirLfn)
-                continue
+        for idx, dirLfn in enumerate(rse['dirs']['toDelete']):


As far as I can tell the toDelete parts of the rse['dirs] object is never cleaned up even though the logic below deletes dirs/files.

Instead of loop I suggest to clearly define function(s) to delete an object. The function can take fobj (either dir or file) and perform the operation and return the status. Then code can be parallelized (to speed up nested operations) and deletion can be done concurrently.

vkuznet · 2024-10-09T12:51:17Z

src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py

@@ -306,9 +304,24 @@ def _execute(self, rseList):
    def cleanRSE(self, rse):
        """
        The method to implement the actual deletion of files for an RSE.
+        Order of deletion attempts is:


It would be beneficial to clearly define logic of deletion, here you outline the steps instead of logic. But the underlying code seems very complex with different rules of what and how it should delete. I suggest to provide description of such logic in the doc string.

vkuznet · 2024-10-09T12:59:21Z

Minor comment: the code relies on class methods/functions with underscores. I expect that those should be protected methods or not directly accessible, but python does not provide such isolation and notations neither enforce it. I don't know the rationale behind this notations.

amaltaro · 2024-10-09T14:03:03Z

I appreciate your code review, thanks Valentin!

Making this code parallel is out of scope, the main issue to be addressed is memory consumption and removal optimizations where possible. As you stated, it is already very complex and I would rather not make further changes to increase it even more.

I will see how to break some implementations into smaller functions/methods and will also provide a diagram for data removal.

todor-ivanov · 2024-10-11T06:30:19Z

hi @amaltaro,

in regards to:

the main issue to be addressed is memory consumption and removal optimizations where possible

May I ask two questions.

What does it mean removal optimizations where possible, and how would one expect to increase code efficiency by removing optimizations ....this was kind of strange.
Do you have any memory measurement before and after you have done your code changes - talking only about the bits related to code refactoring here not including the addition of a streamer logic, because the later would definitely have memory impact.

Let me put it that way, I was surprised to see you refactoring the logic of the service starting from an issue related to creating a streamer to feed the component. This messes things a lot and we could never see the benefits or even the needs of this. Such efforts should be separated. As we know in the team we always requested high issue granularity and to split effort by subject. I've been constantly flagged to put code changes in a PR with this very same argument.
To me it seems an effort driven mostly by a desire to refactor code styling and I cannot see the clear benefit of this. Especially without any measurement of resource consumption before and after the change, which should be possible because the current code is still working one way or another.

So in order to continue with this my suggestion would be to split the changes thematically in two issues:

first one related to refactoring generators and pointers logic
second one related to including streamer logic

Then we measure CPU && Memory consumption (measuring only one of the parameters gives no clear picture) before and after applying the patches to a standalone instance of the service/component (definitely not in Kubernetes), this is the only way to see the actual effect. And I stand behind my words on this - we cannot measure component resource consumption only by monitoring Kubernetes cluster behavior. We should profile the code in much better details - which was exactly what I was doing constantly on every change before we merge for at least a month back then when I was including these optimizations in the code.

todor-ivanov · 2024-10-11T07:06:33Z

one more thing @amaltaro
about this bullet from the PR description:

refactored cleanRSE method by first trying to delete the whole directory. If it fails, then list all the content in that directory and delete its content by slices. Finally, try to remove the (now) empty directory.

This is already in place. If you are simply changing the way how to do it with this PR, that's another story, but this is definitely nothing new for MSUnmerged

amaltaro · 2024-10-11T11:23:13Z

@todor-ivanov thank you for looking into this.

What does it mean removal optimizations where possible, and how would one expect to increase code efficiency by removing optimizations ....this was kind of strange.

I need to update that comment, because what it is actually doing right now is:

trying to remove the parent directory, if it fails
tries to delete the subdirectories, if it fails
get a list of contents (files) in all these subdirectories and remove by 100 file slices.
finally, try to remove any leftover directory (which should now be empty).

I will update the PR description with this. I also need to revisit what was already in place and see the differences.

So in order to continue with this my suggestion would be to split the changes thematically in two issues:

first one related to refactoring generators and pointers logic

second one related to including streamer logic

I don't think it makes sense to separate those developments. Having a generator object is of no help if we actually load the data into memory. So implementing a generator and consuming the generator properly has to be bound. Otherwise we have the same faith of the compressed RucioConMon feature, which was implemented but never adopted in MSUnmerged.

Said that, my goal was indeed just to consume a generator, but memory footprint was still extremely high because of the parsing and assignment of directories and files to data structures.

For the CPU and memory footprint, I actually performed an isolated study with RucioConMon only, which was actually discussed in this transient PR:
#12089

I think it should be simple enough to run the current MSUnmerged only with those changes, versus those changes + modified MSUnmerged. I will try to get back to this next week.

todor-ivanov · 2024-10-11T12:03:56Z

trying to remove the parent directory, if it fails
tries to delete the subdirectories, if it fails
get a list of contents (files) in all these subdirectories and remove by 100 file slices.
finally, try to remove any leftover directory (which should now be empty).

I will update the PR description with this. I also need to revisit what was already in place and see the differences.

All of this is already in place .... I do not see a point of rewriting this logic.... Well if you like it your way .. of course there is a point.

todor-ivanov · 2024-10-11T12:24:06Z

What needs to be done is:

First to loosen the initial constraints of this service and
Second to isolate its (respectively ours) responsibility.

Let me rephrase with some more details:

We do not need to depend on RucioConMon... neither its cycle (which basically introduces a lot of failed behavior on our side) nor the level of granularity of the scans it produces.... Since Reqmgr does not care about the contents of any directory deeper than 3 levels down the tree starting from the workflow name - meaning it stops at the highest possible level in the output - the dataset name..... and there is a solid reason for that - because from this point down the amount of information grows exponentially. Hence the disability of the service to scale ..... No matter how many tricks we apply once we get into the spiral of recursions the complexity function is going to be nothing less but O(exp). So we do not need all this information to be transferred to us.... we should never even try recursive deletions on a file by file basis ... and shoot ourselves in the thumb with remote recursive tree traversals (which on top of everything adds up the protocol overhead etc. etc...). We must stop at the level at which Reqmgr protects the data... and cut the branch there - ergo delete everything directly from this level and now down bellow. Which means we simply do not need these GBs of scanned data on a file by file basis.
When it comes to failures to cut the branch at the level we know is protected by Reqmgr or if we have other like cert issues, any failure gets logged and alarmed to Site Support, and we never chaise those one by one on site by site basis .... we simply need to deliver previously agreed set of failures + alarms to the relevant team (if needed create the proper REST interfaces for that - which already exist btw, and we already accumulate this information). And then the relevant team is to be assigned to communicate the resource issues with the resource providers. If we try to own all the problems of the system that would simply be a disaster if you ask me ...We won't hold such pressure.

todor-ivanov · 2024-10-11T12:37:12Z

But,... if we strictly insist on keeping the dependency on RucioConMon.... ok, then we should download the file or stream it (our choice) , parse it in a separate thread to the level of granularity our service speaks and record it in a shorter file wich would reduce the size from Gbs to Mbs. And this logic is already in the current implementation of MSUnmerged, What I suggest here is simply to exchange one resource with the other - swap memory to i/o expenses. What should be the action item to achieve this with minimal effort would be to transform the MSUnmerge service in Producer consumer mode which we already very well know. And the current code should not be difficult to be split in two pieces without heavy logic rewriting - one thread to parse and preserve the file with reduced contents, one thread to consume it benefiting from all the possible resource optimizations that come out of this.

amaltaro · 2024-10-11T12:39:57Z

If we don't rely on RucioConMon, how do we know which directories have been created in which storage? Are you saying that the scanning logic implemented in Rucio should become part of WMCore? I hope not!

Second, we have no choice but go down to the file level once a directory fails to be deleted (which many times it could be because it has too many content in it, from what I have seen in the logs).

About your last bullet, I totally agree that we have no way to communicate issues with the sites. However, we are responsible for the MSUnmerged functionality. Everyone has to take up responsibility in this process.

todor-ivanov · 2024-10-11T12:46:46Z

If we don't rely on RucioConMon, how do we know which directories have been created in which storage? Are you saying that the scanning logic implemented in Rucio should become part of WMCore? I hope not!

We are already doing worse .. Once we fail a deletion we start all of this process recursively on a depth first logic && remotely..... which is a disaster. So What I suggest is to do this scan ourselves, but to stop at level -3. The level at which our services talk and never dive deeper. This is a fairly fast process and easy, because at that level of the tree the amount of information is still in the lower (almost linear) part of the exponent.... this is something we can confidently manage ourselves. And as I said we are already doing it .... much much much deeper..... so this needs no proof of concept.

amaltaro · 2024-10-11T12:54:54Z

We only scan deeper, not to the top level directories. In addition, we only scan in case of failures, because otherwise there is no way we can delete files.

Just to make sure we are speaking the same language, an example of file to be deleted would be:

          "/store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM/106X_mc2017_realistic_v6-v3/00000/blah.root",

Are you saying that we need to list content up to /store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM only? If so, you understand that to get to this point, we need to recursively list directories starting at the root directory /store/unmerged/, right?

I feel like we are deviating from the original goal of this issue, which is:

to stop breaking the service and make it able to actually delete unmerged files.

We can have the redesign of MSUnmerged discussion in a different moment. For now, let us try to resolve this one issue please.

todor-ivanov · 2024-10-11T12:55:43Z

Second, we have no choice but go down to the file level once a directory fails to be deleted (which many times it could be because it has too many content in it, from what I have seen in the logs).

Yes we have. It was announced that all sites are now supposed to be using WEBDav, which btw in the past has hit us heavily, and exactly in that context I did some research and the outcome was that we could be able to confidently and rightfully ask the recursive deletions to be supported from the sites (similarly to that we ask the write permissions at the unmerged area by certificate role) and we should not own all the protocol level complexities. All of which were supposed to be encapsulated and isolated from us by using gfal - which is the whole meaning of its sole existence btw.

todor-ivanov · 2024-10-11T12:58:13Z

We can have the redesign of MSUnmerged discussion in a different moment. For now, let us try to resolve this one issue please

Your code change does not resolve an issue. With your code change you are already redesigning the service, but without having this discussion which I intentionally triggered here.

amaltaro · 2024-10-11T13:01:28Z

I guess you didn't see the memory plots I provided in here then:
#12089 (comment)

I will try to get the performance measurements you requested at some point next week. Thanks

todor-ivanov · 2024-10-11T13:03:19Z

about this:

Are you saying that we need to list content up to /store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM

I may be citing the wrong tree level... (could be -4 or close), because I do not remember it by heart. The point is, we should simply check the depth of the protectedlfn REST API from reqmgr and stop there at that level.

todor-ivanov · 2024-10-11T13:10:27Z

I guess you didn't see the memory plots I provided in here then:
#12089 (comment)

Alan, I saw them.... I have been running similar tests hundreds of times.... I said it few comments above. If we insist to keep depending on RucioConMon, be it, but lets separate it in a different thread.

And running any test on the current code status would be speculative.... because you did not isolate the changes in different steps so it would be difficult to attribute the results either to some code optimizations or because you changed the way how you feed the service. So they would not be a justification of you rewriting the rest of the logic of the service.

vkuznet · 2024-11-06T13:56:13Z

src/python/WMCore/Services/RucioConMon/RucioConMon.py

-        results = json.loads(results)
-        return results
+        with self.refreshCache(cachedApi, apiUrl, decoder=True, binary=False) as istream:
+            results = istream.read()


NOTE: I don't know how much data istream reads, but here you have two potential memory spikes, one is reading data from istream and another to parse the json. Instead, I suggest to use

return json.load(istream)

which performs both steps as one, i.e. JSON can read from file object directly.

amaltaro added PR: Do not merge yet PR: Work in progress labels Jul 29, 2024

amaltaro force-pushed the msunmer-verbose branch from 00ad580 to e2d5bb8 Compare July 29, 2024 21:11

amaltaro force-pushed the msunmer-verbose branch from 0e3e4ee to a62c363 Compare August 22, 2024 02:25

amaltaro force-pushed the msunmer-verbose branch from a62c363 to f9aa0b8 Compare August 22, 2024 03:05

amaltaro force-pushed the msunmer-verbose branch from f9aa0b8 to e843c02 Compare August 27, 2024 16:18

amaltaro force-pushed the msunmer-verbose branch from e843c02 to a3fc592 Compare August 28, 2024 14:40

amaltaro force-pushed the msunmer-verbose branch from a8f91ee to 1493d18 Compare August 30, 2024 01:11

amaltaro mentioned this pull request Sep 6, 2024

Profile RucioConMon memory #12089

Open

amaltaro changed the title ~~Log each step in the pipeline cleaning up RSE~~ Consume raw/generator unmerged dump data in MSUnmerged Sep 8, 2024

amaltaro added 4 commits October 2, 2024 14:32

Refactor cleanRSE method; list content recursively

128da10

filterUnmergedFiles method no longer exists Fix check for isDeletable Fix key name for dirsDeletedFail check if ctx object exist before freeing it

fix unit tests for RucioConMon

48fdb05

temporarily remove integration tag for unit tests fix RucioConMon unit test fix MSUnmerged unit tests resolve MSUnmerged unit tests

Standlone script to test gfal rmdir

9c9a4d6

Remove sub-directories before going full scale with file deletion

f01c556

amaltaro force-pushed the msunmer-verbose branch from 104f09b to f01c556 Compare October 2, 2024 22:23

amaltaro removed the PR: Work in progress label Oct 4, 2024

amaltaro requested review from todor-ivanov and vkuznet October 4, 2024 03:35

amaltaro commented Oct 4, 2024

View reviewed changes

vkuznet reviewed Oct 9, 2024

View reviewed changes

todor-ivanov approved these changes Nov 5, 2024

View reviewed changes

vkuznet requested changes Nov 6, 2024

View reviewed changes

Consume raw/generator unmerged dump data in MSUnmerged #12059

Are you sure you want to change the base?

Consume raw/generator unmerged dump data in MSUnmerged #12059

Conversation

amaltaro commented Jul 29, 2024 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Jul 29, 2024

amaltaro commented Jul 29, 2024

cmsdmwmbot commented Jul 29, 2024

cmsdmwmbot commented Jul 31, 2024

cmsdmwmbot commented Jul 31, 2024

amaltaro commented Aug 1, 2024

cmsdmwmbot commented Aug 1, 2024

cmsdmwmbot commented Aug 5, 2024

cmsdmwmbot commented Aug 5, 2024

cmsdmwmbot commented Aug 6, 2024

cmsdmwmbot commented Aug 21, 2024

cmsdmwmbot commented Aug 21, 2024

cmsdmwmbot commented Aug 22, 2024

cmsdmwmbot commented Aug 22, 2024

cmsdmwmbot commented Aug 27, 2024

cmsdmwmbot commented Aug 28, 2024

cmsdmwmbot commented Aug 28, 2024

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Sep 5, 2024

cmsdmwmbot commented Oct 2, 2024

amaltaro commented Oct 4, 2024

amaltaro Oct 4, 2024

Choose a reason for hiding this comment

vkuznet left a comment

Choose a reason for hiding this comment

vkuznet Oct 9, 2024

Choose a reason for hiding this comment

vkuznet Oct 9, 2024

Choose a reason for hiding this comment

vkuznet Oct 9, 2024

Choose a reason for hiding this comment

vkuznet commented Oct 9, 2024

amaltaro commented Oct 9, 2024

todor-ivanov commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024 • edited Loading

amaltaro commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

amaltaro commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024 • edited Loading

amaltaro commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

amaltaro commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

todor-ivanov commented Oct 11, 2024

vkuznet Nov 6, 2024

Choose a reason for hiding this comment

amaltaro commented Jul 29, 2024 •

edited

Loading

todor-ivanov commented Oct 11, 2024 •

edited

Loading

todor-ivanov commented Oct 11, 2024 •

edited

Loading