Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource request plugin should remove the classads when shutdown #200

Open
DmitryLitvintsev opened this issue Apr 1, 2020 · 13 comments
Open
Assignees
Labels
modules_req Issue that will be handled in the course of the project modules work

Comments

@DmitryLitvintsev
Copy link
Contributor

No description provided.

@DmitryLitvintsev DmitryLitvintsev self-assigned this Apr 1, 2020
@DmitryLitvintsev
Copy link
Contributor Author

@StevenCTimm
Copy link
Contributor

The history of this is that this is something Parag worked on for a month or two while he was still here and never got it working. As the Decision Engine exits it should de-advertise its request classads but there were some problems trying to put all of that into the destructor method. This should be kept in mind--the module that advertises the classads is due to be totally reworked in the current phase of the project.

@StevenCTimm StevenCTimm added bug Something isn't working modules_req Issue that will be handled in the course of the project modules work and removed bug Something isn't working labels Jun 17, 2020
@shreyb shreyb self-assigned this Oct 8, 2020
@shreyb
Copy link
Contributor

shreyb commented Nov 3, 2020

Notes from a meeting we had:

  • Two glideinwms modules are using the HTCondorManifests class that currently has a destructor method (del) defined here: https://github.com/HEPCloud/decisionengine_modules/blob/master/htcondor/publishers/publisher.py#L41

  • Steve Timm has seen that this del method hasn't been called consistently

  • Marc Paterno explained that the del method on an object is only called during garbage collection, which the python runtime will choose to run automatically. There is no way to kick this off manually (aside from calling object.del() which is not pythonic). This is probably why the del method is never getting called at channel shutdown currently.

  • Instead, it's better to use a shutdown function that can be called by the code directly (Dmitry favors this), or wrap invocations of module classes in context managers.

  • Channel shutdown is controlled by the TaskManager. Something tells the TaskManager that a channel must be shut down, and it performs the necessary operations and communications to do so.

  • In DecisionEngine.py, the DecisionEngine.stop_channel() method (https://github.com/HEPCloud/decisionengine/blob/master/framework/engine/DecisionEngine.py#L358) calls the method take_offline to tell a TaskManager to shut down.

  • The TaskManager itself runs two loops - it pulls data from the Sources, and executes the decision_cycle. If it is told to stop (ProcessingState.should_stop() is set appropriately (https://github.com/HEPCloud/decisionengine/blob/master/framework/taskmanager/ProcessingState.py)), the TaskManager will terminate.

    The idea (proposed by Kyle) for implementing what we need is thus three-fold:

  1. At the framework level, the Publisher class (https://github.com/HEPCloud/decisionengine/blob/master/framework/modules/Publisher.py) should have a new method defined, called "shutdown", that can be overridden by child classes
  2. In the TaskManager code, once a TaskManager sees that it is supposed to shut down, it should finish its loops, and then loop through all of its publishers and run this new shutdown() method on all of them (probably at line 181 of TaskManager.py).
  3. We can then define shutdown methods on any inheriting Publishers that need them (in the immediate case, replacing the del method of the HTCondorManifests class mentioned above)

Note: (1) is done. Working on (2) now.

@shreyb
Copy link
Contributor

shreyb commented Nov 17, 2020

All three items are done, and pass all CI tests. Will send Steve build artifacts from CI when he's available to test.

@shreyb
Copy link
Contributor

shreyb commented Nov 19, 2020

Passed build artifacts to Steve. Waiting on his testing at his availability.

@StevenCTimm
Copy link
Contributor

Attempted to test Shreyas' build artifacts failed due to problems that as far as I can tell are not related to any changes he made in the code. The decisionengine fails to advertise to the collector at all (with a value error) . This is very similar to a bug we saw in the 1.4 version and there is a chance that the origin may be the same.

Testing the unmodified 1.5rc0 rpm (which had not yet been shown to work even without Shreyas' modification) to see if it
has the same problem. If it does I will open a new issue.

@StevenCTimm
Copy link
Contributor

After resolving the issue with the test machine fermicloud117, which had nothing to do with the code, I installed Shreyas' branch again. started up the DE, it advertised classads to the factory just fine, then initiated a stop of the DE.
Classads on their own take 1600s to go away after the DE is gone. With the new patch we see:
last heard from timestamp is 1610390010 for the globals and 1610390014 for the glide clients.
DE was off at 1610390239

The following shows from the resource_request.log at the time of shutdown:

2021-01-11 12:36:50,361 - root - TaskManager - 19742 - MainThread - ERROR - error in decision cycle(publishers)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 283, in decision_cycle
a_f['actions'], a_f['newfacts'], data_block_t1)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 480, in run_publishers
publisher.worker.publish(data_block)
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 126, in publish
dataframe = datablock.get(key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 271, in get
return self.getitem(key, default=default)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in getitem
raise KeyError
KeyError

There is nothing from this stack trace (or the surrounding debug logs) that indicates that we were in the new shutdown method when this happened. Rather it appears that the routine was called asynchronously at a time when some if not all of the data blocks needed to call the publisher may not be available.

It may be helpful at this point to consult Marco Mambelli or one of the other frontend developers. This functionality does work in the glideinwms frontend, although the code path is entirely different since the DE uses the python bindings to condor_advertise and the frontend does a system shell out to the condor_advertise binary.

Again I observed that it took 1600s for the classads to go away out of the collector, same as if there had been no patch.

Next debugging effort, (after talking to Marco) would be to add some debugging to see if the shutdown method is even being called by the framework at all. I have the test system configured at maximum debug level and there is no indication from any of the messages that we ever attempted to call the shutdown method of the publisher in question. But there is also no proof that we didn't. Also understanding the exception would be key.

@StevenCTimm
Copy link
Contributor

Note the modified code remains on fermicloud117 and is available for further tests.

@StevenCTimm
Copy link
Contributor

Discussion in this morning's meeting led me to see that overnight (Jan 13) the code as written had the resource_request channel go offline due to a network error, and in that case it did call the shutdown method of the publisher, you could see it in the logs. So there may be a different program flow when systemctl stop decision-engine is called as opposed to when the framework decides to declare a channel offline. Since the problem happened in the middle of the night I was not able to verify if the classads were successfully de-advertised when the shutdown was called.

@StevenCTimm
Copy link
Contributor

2021-01-13 22:01:37,347 - root - TaskManager - 21943 - MainThread - ERROR - error in decision cycle(publishers)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 283, in decision_cycle
a_f['actions'], a_f['newfacts'], data_block_t1)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 480, in run_publishers
publisher.worker.publish(data_block)
File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 126, in publish
dataframe = datablock.get(key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 271, in get
return self.getitem(key, default=default)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in getitem
raise KeyError
KeyError
2021-01-13 22:01:37,365 - root - TaskManager - 21943 - MainThread - INFO - Task Manager 30F3EB74-288A-474D-BB50-4533A07EF929 received stop signal and exits
2021-01-13 22:01:37,365 - root - TaskManager - 21943 - AWSBurnRateSourceProxy - INFO - received stop_running signal for AWSBurnRateSourceProxy
2021-01-13 22:01:37,365 - root - TaskManager - 21943 - AWSBurnRateSourceProxy - INFO - stopped AWSBurnRateSourceProxy
2021-01-13 22:01:42,375 - root - TaskManager - 21943 - BillingInfoSourceProxy - INFO - received stop_running signal for BillingInfoSourceProxy
2021-01-13 22:01:42,376 - root - TaskManager - 21943 - BillingInfoSourceProxy - INFO - stopped BillingInfoSourceProxy
2021-01-13 22:01:47,380 - root - TaskManager - 21943 - FigureOfMeritSourceProxy - INFO - received stop_running signal for FigureOfMeritSourceProxy
2021-01-13 22:01:47,382 - root - TaskManager - 21943 - FigureOfMeritSourceProxy - INFO - stopped FigureOfMeritSourceProxy
2021-01-13 22:01:52,381 - root - TaskManager - 21943 - FinancialParameters - INFO - received stop_running signal for FinancialParameters
2021-01-13 22:01:52,382 - root - TaskManager - 21943 - FinancialParameters - INFO - stopped FinancialParameters
2021-01-13 22:01:57,386 - root - TaskManager - 21943 - GCEBillingInfoSourceProxy - INFO - received stop_running signal for GCEBillingInfoSourceProxy
2021-01-13 22:01:57,387 - root - TaskManager - 21943 - GCEBillingInfoSourceProxy - INFO - stopped GCEBillingInfoSourceProxy
2021-01-13 22:02:02,091 - root - TaskManager - 21943 - JobClusteringSourceProxy - INFO - stopped JobClusteringSourceProxy
2021-01-13 22:02:02,391 - root - TaskManager - 21943 - GceBurnRateSourceProxy - INFO - received stop_running signal for GceBurnRateSourceProxy
2021-01-13 22:02:02,391 - root - TaskManager - 21943 - GceBurnRateSourceProxy - INFO - stopped GceBurnRateSourceProxy
2021-01-13 22:02:07,391 - root - TaskManager - 21943 - GceFigureOfMeritSourceProxy - INFO - received stop_running signal for GceFigureOfMeritSourceProxy
2021-01-13 22:02:07,392 - root - TaskManager - 21943 - GceFigureOfMeritSourceProxy - INFO - stopped GceFigureOfMeritSourceProxy
2021-01-13 22:02:12,397 - root - TaskManager - 21943 - NerscAllocationInfoSourceProxy - INFO - received stop_running signal for NerscAllocationInfoSourceProxy
2021-01-13 22:02:12,398 - root - TaskManager - 21943 - NerscAllocationInfoSourceProxy - INFO - stopped NerscAllocationInfoSourceProxy
2021-01-13 22:02:17,402 - root - TaskManager - 21943 - NerscFigureOfMeritSourceProxy - INFO - received stop_running signal for NerscFigureOfMeritSourceProxy
2021-01-13 22:02:17,404 - root - TaskManager - 21943 - NerscFigureOfMeritSourceProxy - INFO - stopped NerscFigureOfMeritSourceProxy
2021-01-13 22:02:22,407 - root - TaskManager - 21943 - FactoryEntriesSourceProxy - INFO - received stop_running signal for FactoryEntriesSourceProxy
2021-01-13 22:02:22,408 - root - TaskManager - 21943 - FactoryEntriesSourceProxy - INFO - stopped FactoryEntriesSourceProxy
2021-01-13 22:02:27,413 - root - TaskManager - 21943 - FactoryGlobalManifests - INFO - received stop_running signal for FactoryGlobalManifests
2021-01-13 22:02:27,413 - root - TaskManager - 21943 - FactoryGlobalManifests - INFO - stopped FactoryGlobalManifests
2021-01-13 22:02:37,421 - root - TaskManager - 21943 - JobQSourceProxy - INFO - received stop_running signal for JobQSourceProxy
2021-01-13 22:02:37,422 - root - TaskManager - 21943 - JobQSourceProxy - INFO - stopped JobQSourceProxy
2021-01-13 22:02:42,427 - root - TaskManager - 21943 - StartdManifestsSourceProxy - INFO - received stop_running signal for StartdManifestsSourceProxy
2021-01-13 22:02:42,428 - root - TaskManager - 21943 - StartdManifestsSourceProxy - INFO - stopped StartdManifestsSourceProxy
2021-01-13 22:02:57,443 - root - publisher - 21943 - MainThread - INFO - Called HTCondorManifests.shutdown
2021-01-13 22:02:57,443 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclient classads from collector_host cmssrv258.fnal.gov with constraint (glideinmytype == "glideclient") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough"))
2021-01-13 22:03:33,529 - root - publisher - 21943 - MainThread - INFO - Advertising glideclient classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:03:33,531 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclient classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:03:33,532 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds
2021-01-13 22:04:09,607 - root - publisher - 21943 - MainThread - INFO - Advertising glideclient classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:04:09,608 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclient classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:04:09,608 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries
2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclient classads from collector_host cmssrv258.fnal.gov
2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - INFO - Called HTCondorManifests.shutdown
2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclientglobal classads from collector_host cmssrv258.fnal.gov with constraint (glideinmytype == "glideclientglobal") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_nersc_passthrough_sl7,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_sdsc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_xsede_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_wilson_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl6,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl7"))
2021-01-13 22:04:45,691 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:04:45,693 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:04:45,694 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds
2021-01-13 22:05:21,775 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:05:21,777 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host cmssrv258.fnal.gov
2021-01-13 22:05:21,777 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries
2021-01-13 22:05:21,778 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclientglobal classads from collector_host cmssrv258.fnal.gov
2021-01-13 22:05:21,778 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclientglobal classads from collector_host gfactory-2.opensciencegrid.org with constraint (glideinmytype == "glideclientglobal") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_nersc_passthrough_sl7,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_sdsc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_xsede_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_wilson_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl6,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl7"))
2021-01-13 22:05:57,851 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org
2021-01-13 22:05:57,853 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org
2021-01-13 22:05:57,853 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds
2021-01-13 22:06:33,935 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org
2021-01-13 22:06:33,937 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org
2021-01-13 22:06:33,938 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries
2021-01-13 22:06:33,938 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclientglobal classads from collector_host gfactory-2.opensciencegrid.org

@StevenCTimm
Copy link
Contributor

Note that when that channel failed there was a print statement in startup.log
already saying

Called Publisher.shutdown
Called Publisher.shutdown
Called Publisher.shutdown

One for each of the 5 channels that went offline.

I have restarted that DE now but saved the file as it was.

@StevenCTimm
Copy link
Contributor

Note also that the above resource_request.log shows that the deadvertise command didn't work, but only because the network was down and it couldn't actually reach the factory to do the advertise.

@shreyb
Copy link
Contributor

shreyb commented May 18, 2021

Current status: This has been split off into two portions: one for the framework and one for the modules.

  1. Framework - DONE We implemented a new method in the Publisher base class, called shutdown, which is meant to run any actions necessary before a channel is actually stopped. This is invoked in the TaskManager running a channel via a set_to_shutdown method that runs a TaskManager's channel's shutdown methods. The set_to_shutdown method is called from the DecisionEngine when a channel is stopped. This code has been merged into 1.6 (decisionengine PR Fix coding error #334), and after a quick rebase and fix to logging, will have a PR.

  2. Modules - IN PROGRESS The aforementioned shutdown method is correctly called when the resource_request channel is stopped, but the actual deadvertising doesn't happen. We'll need to continue to investigate why this is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
modules_req Issue that will be handled in the course of the project modules work
Projects
None yet
Development

No branches or pull requests

3 participants