Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In 1.5.0rc0 (trunk) DE fails to advertise to collector with value error #302

Closed
StevenCTimm opened this issue Jan 11, 2021 · 9 comments
Closed
Labels
prj_testing Issue identified by HEPCloud Phase IV Integration Testing

Comments

@StevenCTimm
Copy link
Contributor

No description provided.

@StevenCTimm
Copy link
Contributor Author

2021-01-11 08:22:42,193 - root - publisher - 12376 - MainThread - INFO - Advertising glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-11 08:22:42,210 - root - publisher - 12376 - MainThread - ERROR - Error running UPDATE_AD_GENERIC for glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-11 08:22:42,210 - root - retry_function - 12376 - MainThread - ERROR - Error Function _condor_advertise giving up with Failed to advertise to collector after 1 retries
2021-01-11 08:22:42,210 - root - publisher - 12376 - MainThread - ERROR - Failed to publish
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 146, in publish_to_htcondor
self.condor_advertise(ads, collector_host=collector)
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 114, in condor_advertise
self.nretries, self.retry_interval)
File "/usr/lib/python3.6/site-packages/decisionengine_modules/util/retry_function.py", line 36, in retry_wrapper
raise e
File "/usr/lib/python3.6/site-packages/decisionengine_modules/util/retry_function.py", line 27, in retry_wrapper
return f()
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 93, in _condor_advertise
collector.advertise(ads, update_ad_command, True)
ValueError: Failed to advertise to collector

We had a similar problem like this early in testing the 1.4 series. I will try to dig up more details.
It is possible one of the 1.4 patches didn't make it into trunk.

@StevenCTimm StevenCTimm added the prj_testing Issue identified by HEPCloud Phase IV Integration Testing label Jan 11, 2021
@StevenCTimm
Copy link
Contributor Author

The issue with the 1.4 code was tracked in ticket #263 It is not immediately obvious to me if this is the same problem we are having now or not.

@StevenCTimm
Copy link
Contributor Author

I could do an easy test and revert to the 1.4 modules and see if I still have the same problem.
The publisher.py which was patched to fix the problem in 1.4 appears to be identical to the one that's in 1.5 trunk. so wherever the change was done, it wasn't there. Also the hex dump of the glideclientglobal_manifests data block appears to be fine, although with so much hex garbagoose it's hard to know for sure.

@StevenCTimm
Copy link
Contributor Author

OK reverting to decisionengine-standard-library-1.4.2-1 (same as what production runs) still gives me trouble.
This is pointing, potentially, to an issue on fermicloud117 that may be local in nature.. will revert framework also to be sure.

@StevenCTimm
Copy link
Contributor Author

Turns out apparently we've been dealing with a GSI authentication failure all this time.. believe this can be fixed.
Once fetch-crl was enabled on the VM again, the VM was able to advertise. (in the 1.4 revert)

Now go forward again to vanilla 1.5.rc0 on both.

@StevenCTimm
Copy link
Contributor Author

vanilla 1.5.rc0 also successfully advertises glideclientglobals now
submitting real jobs to make sure glideclient classads are OK too
Then will kill the DE and see how long it takes the classads to go away in the vanilla unmodified version

@StevenCTimm
Copy link
Contributor Author

yes glideclient classads also submitted correctly, and glideins submitted by the factory. now killing the DE.
Things supposed to go away in 15 mins
Time stamp is 1610387935 on the globals and 1610387939 on the glide clients
expect them to go away 900s later
It ended up being about 1600s later but they are gone.

@StevenCTimm
Copy link
Contributor Author

Now proceed to test Shreyas' patch again in issue #200. It appears that this whole issue was just due to misconfiguration on the test machine fermicloud117.

@StevenCTimm
Copy link
Contributor Author

Closing this--1.5.0rc0 is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prj_testing Issue identified by HEPCloud Phase IV Integration Testing
Projects
None yet
Development

No branches or pull requests

1 participant