Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision Engine requests Glideins in the factory despite jobs being completed #490

Open
namrathaurs opened this issue Jan 23, 2024 · 1 comment

Comments

@namrathaurs
Copy link
Contributor

The very first observation of this was during some activity on a Decision Engine (DE) instance that talks to an ITB factory (798, running GlideinWMS 3.10.5-1). The glideins were being requested even though the job that was submitted by the DE had completed. Verified to ensure that the requests were not coming from either the DE client in question or some other clients because of jobs being in the respective job queues. A request coming from the client includes two numbers: ReqMaxGlideins and ReqIdleGlideins which are of interest to understand the underlying behavior. Upon further investigation of the glideclient and glidefactoryclient classads, it was found that:

  1. When jobs were submitted and were present in the DE queue — condor_q shows jobs in running and idle state (5 processes submitted and each one has a sleep for 10 minutes):
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "4a098979b299b9c2cacf843ac6d3e4a4a34384b1f24b0189e4d2e7ec6d52d5b2dee40e62c3942db795ea32f53be8d9c4"
ReqEncKeyCode = "64e12146066e3efbe3480fe726ecc4cd0fd196cd712640c7c1cf96a87b56dbdb8ec6c35b6f28c68f551ea639b2642532240422397730765e95bebfb8e0e843bfe0b964aac909c31f5365e586f6d2aef3ea93bebe9d2f9abba786c0bef344484c6e128c06b881a1d31e31a3c01bf782780aaf52afa7c02238c379fb32b7f8a35dd11ae7b534a03f7b689bdf795d5be339457a77555fb75998d838524d0203268e0400d861b1a00bcffd3881fe76ddba3a9e864b37618957ef87f052bac6aeda07ff445bc7af791ed921a237c9859120125c69b7613e9c0fa462c7f4649757e05f7e8cdd2508bc06059aace4642ae3bb1aa40db54d34ae2496104383f26d5ac124"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 6
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 6
GlideinMonitorRequestedMaxGlideins = 6
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 6
GlideinMonitorTotalRequestedMaxGlideins = 6
  1. When there were 2 completed jobs and 3 were in running state in the DE:
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2
  1. When submitted jobs in the DE completed — DE queue was empty upon doing a condor_q:
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2

After excessively requesting glideins, at some point, the glideclient classad vanishes from the factory after which no more glideins are requested in the factory. Since this classad vanishes after its expiration, glideins not being requested makes sense since the classad is no longer present.

This very same behavior of glideins being requested even though the DE job queue is empty was also observed in a couple of instances:

  • On a new setup of my DE development instance, multiple glideins are excessively requested and run in the factory, all sources except source1 went offline. Only test_channel was steady but resource_request went offline. This further confirmed that it is something on the DE side.
  • Reached out to Vito regarding this behavior to see if he had experienced something similar with his activity on DE as he was conducting some tests on a daily basis. He also confirmed that he's been seeing the exact behavior where new glideins show up even after his jobs were completed. This was reported at the weekly DE meeting just before winter break to understand if this was the expected behavior. The outcome of the discussion was that this behavior is not okay as glideins queue up and goes wasted as they are not used at all, if there are no further jobs to be run.
@namrathaurs
Copy link
Contributor Author

Discussed my findings/observations with Marco and following are his inputs:

  • Glidein removal in Decision Engine is based on a version of the frontend logic
  • From my observations, this seems to be a problem on the DE side and not the factory. As the factory is able to request glideins based on the client’s requests, the core problem of requesting for glideins even when the DE job queue is empty has got nothing to do with it being a factory/GlideinWMS issue. Since DE is making such requests, there could be something faulty in the DE.
  • Existence of a configuration option that could be used to addressed this. DE has an option to set/define how the glidein removal should happen. There exists a parameter named reserve that overcommits the number of glideins [link here]. The tag named idle_glideins_per_entry has documentation for max but not reserve.

Initially, we thought this might be something on the GlideinWMS factory and this could be happening due to a bug. After the investigation, it seems more likely that DE could be requesting for more glideins. A suggestion provided was to thoroughly review DE code to avoid this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant