-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource request plugin should remove the classads when shutdown #200
Comments
Imported from GitLab: https://hepcloud-git.fnal.gov:8443/hepcloud/decisionengine_modules/issues/149 |
The history of this is that this is something Parag worked on for a month or two while he was still here and never got it working. As the Decision Engine exits it should de-advertise its request classads but there were some problems trying to put all of that into the destructor method. This should be kept in mind--the module that advertises the classads is due to be totally reworked in the current phase of the project. |
Notes from a meeting we had:
Note: (1) is done. Working on (2) now. |
All three items are done, and pass all CI tests. Will send Steve build artifacts from CI when he's available to test. |
Passed build artifacts to Steve. Waiting on his testing at his availability. |
Attempted to test Shreyas' build artifacts failed due to problems that as far as I can tell are not related to any changes he made in the code. The decisionengine fails to advertise to the collector at all (with a value error) . This is very similar to a bug we saw in the 1.4 version and there is a chance that the origin may be the same. Testing the unmodified 1.5rc0 rpm (which had not yet been shown to work even without Shreyas' modification) to see if it |
After resolving the issue with the test machine fermicloud117, which had nothing to do with the code, I installed Shreyas' branch again. started up the DE, it advertised classads to the factory just fine, then initiated a stop of the DE. The following shows from the resource_request.log at the time of shutdown: 2021-01-11 12:36:50,361 - root - TaskManager - 19742 - MainThread - ERROR - error in decision cycle(publishers) There is nothing from this stack trace (or the surrounding debug logs) that indicates that we were in the new shutdown method when this happened. Rather it appears that the routine was called asynchronously at a time when some if not all of the data blocks needed to call the publisher may not be available. It may be helpful at this point to consult Marco Mambelli or one of the other frontend developers. This functionality does work in the glideinwms frontend, although the code path is entirely different since the DE uses the python bindings to condor_advertise and the frontend does a system shell out to the condor_advertise binary. Again I observed that it took 1600s for the classads to go away out of the collector, same as if there had been no patch. Next debugging effort, (after talking to Marco) would be to add some debugging to see if the shutdown method is even being called by the framework at all. I have the test system configured at maximum debug level and there is no indication from any of the messages that we ever attempted to call the shutdown method of the publisher in question. But there is also no proof that we didn't. Also understanding the exception would be key. |
Note the modified code remains on fermicloud117 and is available for further tests. |
Discussion in this morning's meeting led me to see that overnight (Jan 13) the code as written had the resource_request channel go offline due to a network error, and in that case it did call the shutdown method of the publisher, you could see it in the logs. So there may be a different program flow when systemctl stop decision-engine is called as opposed to when the framework decides to declare a channel offline. Since the problem happened in the middle of the night I was not able to verify if the classads were successfully de-advertised when the shutdown was called. |
2021-01-13 22:01:37,347 - root - TaskManager - 21943 - MainThread - ERROR - error in decision cycle(publishers) |
Note that when that channel failed there was a print statement in startup.log Called Publisher.shutdown One for each of the 5 channels that went offline. I have restarted that DE now but saved the file as it was. |
Note also that the above resource_request.log shows that the deadvertise command didn't work, but only because the network was down and it couldn't actually reach the factory to do the advertise. |
Current status: This has been split off into two portions: one for the framework and one for the modules.
|
No description provided.
The text was updated successfully, but these errors were encountered: