Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illlegal file names in DBS ? #656

Open
belforte opened this issue Sep 16, 2021 · 9 comments
Open

illlegal file names in DBS ? #656

belforte opened this issue Sep 16, 2021 · 9 comments

Comments

@belforte
Copy link
Member

Hi @yuyiguo
I have found an odd thing chasing some CRAB Publisher issue.
There are jobs which processed this dataset
/EmbeddingRun2017E/ElMuFinalState-inputDoubleMu_94X_miniAOD-v2/USER from phys03

When inserting the outputs in DBS, I get an exception from DBS when looking up the jobs parent file names because those LFN ends with the underscore character _, i.e. calls to DBS API return error with that:

 globalApi.listBlocks(logical_file_name='/store/user/belfo/a/b/c/d.root_')
*** HTTPError: HTTP Error 400: Invalid Input Data /store/use...: Not Match Required Format

while this works (of course this file is not present, so it returns an empty llist)

globalApi.listBlocks(logical_file_name='/store/user/belfo/a/b/c/d.root')
[]

So.. if the underscore at the end is illegal, how coudl those file names enter in DBS to begin with ?
see:
https://cmsweb.cern.ch/das/request?instance=prod/phys03&input=file+dataset%3D%2FEmbeddingRun2017E%2FElMuFinalState-inputDoubleMu_94X_miniAOD-v2%2FUSER

Is this a shortcmoning in WMCore's Lexicon, or some stricter checking in DBS list API ?
In CRAB we always validate user LFN's with Lexicon before attempting to insert in DBS but I can not find clear confirmation that this dataset has been put in phys03 by CRAB, e.g. there is a single block with 15K files and CRAB always has a limit at 100 files per block.

@belforte
Copy link
Member Author

belforte commented Sep 16, 2021

I tried to check if Lexicon would have spotted this, and much to my surprise it accepts the file name with _ at the end. I tried to dig and found this which I can't make sense of, the string candidate ends with underscore but when checked against regexp4 which ends with .root returns True. I am quite puzzled as would have expected this to fail, but I surely do not know regexps...

>>> from WMCore.Lexicon import check
>>> check(regexp4, candidate)
True
>>> regexp4
'/store/(temp/)*(user|group)/(([a-zA-Z0-9\\.]+)|([a-zA-Z0-9\\-_]+))/([a-zA-Z][a-zA-Z0-9\\-_]*)/(([a-zA-Z0-9\\-_]+)/)+([a-zA-Z0-9\\-_]+).root'
>>> candidate
'/store/user/belfo/a/b/c/d.root_'
>>> check(regexp4, candidate)
True
>>> 

ref.
https://github.com/dmwm/WMCore/blob/bb573b442a53717057c169b05ae4fae98f31063b/src/python/WMCore/Lexicon.py#L359

@belforte
Copy link
Member Author

In any case, it seems not good that DBS does not allow to lookup information using as search string a file name stored in it.

@dan131riley
Copy link

Probably should be

^/store/(temp/)*(user|group)/(([a-zA-Z0-9\\.]+)|([a-zA-Z0-9\\-_]+))/([a-zA-Z][a-zA-Z0-9\\-_]*)/(([a-zA-Z0-9\\-_]+)/)+([a-zA-Z0-9\\-_]+).root$

i.e. anchor the beginning and end of the string with ^ and $.

@belforte
Copy link
Member Author

sounds right @dan131riley. I doubt the lack of anchors was intentional. What do you think @amaltaro , any danger in making Lexicon really do what it meant to do ?
@yuyiguo it may not be optimal that DBS has a LFN format validation independent of Lexicon, and that the check on LFN's used to read is different from the check on LFN's written to the DN. But I do not thing this should be changed right now, OTOH given that cat is out of the box, can you consider relaxing checks in the server to accept the LFNs which are already in the system ?
Anyhow I am working around this for the time being by skipping insertion of parent LFN when the DBS lookup for them throws an exception.

@amaltaro
Copy link
Contributor

I totally agree that we should add $ to the end of that regular expression. @belforte please let me know if you want me to take care of it tomorrow or if you will.

Regarding read operations, I did not know that regex checks were enforced there as well. Perhaps it's required to accept some "wildcards" in the user read calls? Otherwise, maybe this is something that we could discuss for the future generation and see whether it can be relaxed indeed.

@yuyiguo
Copy link
Member

yuyiguo commented Sep 16, 2021

I was quite busy today and got a lot of unread emails. I don't remember on top of my head that DBS reader and writer use different checks. The only one we use is the Lexicon shared by DMWM. I will check the code.

@yuyiguo
Copy link
Member

yuyiguo commented Sep 17, 2021

I checked DBS code, the reader has much relaxed check than the writer because that we introduced the common Lexicon much later than the time CMS data recorded. DBS reader has to be able to read the old data. Here is what we have in the reader lfn check:

def reading_lfn_check(candidate):

The CMS lfn should have the format as define here : https://github.com/dmwm/WMCore/blob/bb573b442a53717057c169b05ae4fae98f31063b/src/python/WMCore/Lexicon.py#L347
However, the Lexicon has a bug that prevents it to enforce the format as it defined. Dan already pointed out it should anchor the beginning and end of the string with ^ and $.
So it is not a problem with DBS as @belforte thought. I think that we should fix the Lexicon. I will check the DB to see how many file with the root_. I am not sure what you want to with these files.

@belforte
Copy link
Member Author

thanks @yuyiguo . I remember that rules for reading had to be more relaxed, that's why I was suprised that some LFNs could be present in DBS yet not usable for reading.
Origin of the problem is understood, but it will be a while until Lexicon is changed and new version of DBS Server deployed with it.
An issue should be open in WMCore for fixing the LExicon, but I suspect it will only happen as part of the larger Lexicon reformatting that @vkuznet is doing. @amaltaro can you follow with @vkuznet on this ?

I suspect there is no solution for the files with bad names. I suggest to leave them as they are. I have now changed CRAB Publisher code so that it skips parent files which can't be found in DBS. And if a user complains that parents have not been recorded, we'll know what to say.

@yuyiguo
Copy link
Member

yuyiguo commented Sep 17, 2021

Ok, @belforte .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants