You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The very first observation of this was during some activity on a Decision Engine (DE) instance that talks to an ITB factory (798, running GlideinWMS 3.10.5-1). The glideins were being requested even though the job that was submitted by the DE had completed. Verified to ensure that the requests were not coming from either the DE client in question or some other clients because of jobs being in the respective job queues. A request coming from the client includes two numbers: ReqMaxGlideins and ReqIdleGlideins which are of interest to understand the underlying behavior. Upon further investigation of the glideclient and glidefactoryclient classads, it was found that:
When jobs were submitted and were present in the DE queue — condor_q shows jobs in running and idle state (5 processes submitted and each one has a sleep for 10 minutes):
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "4a098979b299b9c2cacf843ac6d3e4a4a34384b1f24b0189e4d2e7ec6d52d5b2dee40e62c3942db795ea32f53be8d9c4"
ReqEncKeyCode = "64e12146066e3efbe3480fe726ecc4cd0fd196cd712640c7c1cf96a87b56dbdb8ec6c35b6f28c68f551ea639b2642532240422397730765e95bebfb8e0e843bfe0b964aac909c31f5365e586f6d2aef3ea93bebe9d2f9abba786c0bef344484c6e128c06b881a1d31e31a3c01bf782780aaf52afa7c02238c379fb32b7f8a35dd11ae7b534a03f7b689bdf795d5be339457a77555fb75998d838524d0203268e0400d861b1a00bcffd3881fe76ddba3a9e864b37618957ef87f052bac6aeda07ff445bc7af791ed921a237c9859120125c69b7613e9c0fa462c7f4649757e05f7e8cdd2508bc06059aace4642ae3bb1aa40db54d34ae2496104383f26d5ac124"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 6
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 6
GlideinMonitorRequestedMaxGlideins = 6
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 6
GlideinMonitorTotalRequestedMaxGlideins = 6
When there were 2 completed jobs and 3 were in running state in the DE:
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2
When submitted jobs in the DE completed — DE queue was empty upon doing a condor_q:
# From the glideclient classad:
[root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
ReqIdleGlideins = 1
ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@[email protected]_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2
After excessively requesting glideins, at some point, the glideclient classad vanishes from the factory after which no more glideins are requested in the factory. Since this classad vanishes after its expiration, glideins not being requested makes sense since the classad is no longer present.
This very same behavior of glideins being requested even though the DE job queue is empty was also observed in a couple of instances:
On a new setup of my DE development instance, multiple glideins are excessively requested and run in the factory, all sources except source1 went offline. Only test_channel was steady but resource_request went offline. This further confirmed that it is something on the DE side.
Reached out to Vito regarding this behavior to see if he had experienced something similar with his activity on DE as he was conducting some tests on a daily basis. He also confirmed that he's been seeing the exact behavior where new glideins show up even after his jobs were completed. This was reported at the weekly DE meeting just before winter break to understand if this was the expected behavior. The outcome of the discussion was that this behavior is not okay as glideins queue up and goes wasted as they are not used at all, if there are no further jobs to be run.
The text was updated successfully, but these errors were encountered:
Discussed my findings/observations with Marco and following are his inputs:
Glidein removal in Decision Engine is based on a version of the frontend logic
From my observations, this seems to be a problem on the DE side and not the factory. As the factory is able to request glideins based on the client’s requests, the core problem of requesting for glideins even when the DE job queue is empty has got nothing to do with it being a factory/GlideinWMS issue. Since DE is making such requests, there could be something faulty in the DE.
Existence of a configuration option that could be used to addressed this. DE has an option to set/define how the glidein removal should happen. There exists a parameter named reserve that overcommits the number of glideins [link here]. The tag named idle_glideins_per_entry has documentation for max but not reserve.
Initially, we thought this might be something on the GlideinWMS factory and this could be happening due to a bug. After the investigation, it seems more likely that DE could be requesting for more glideins. A suggestion provided was to thoroughly review DE code to avoid this scenario.
The very first observation of this was during some activity on a Decision Engine (DE) instance that talks to an ITB factory (798, running GlideinWMS 3.10.5-1). The glideins were being requested even though the job that was submitted by the DE had completed. Verified to ensure that the requests were not coming from either the DE client in question or some other clients because of jobs being in the respective job queues. A request coming from the client includes two numbers:
ReqMaxGlideins
andReqIdleGlideins
which are of interest to understand the underlying behavior. Upon further investigation of theglideclient
andglidefactoryclient
classads, it was found that:condor_q
shows jobs in running and idle state (5 processes submitted and each one has a sleep for 10 minutes):condor_q
:After excessively requesting glideins, at some point, the
glideclient
classad vanishes from the factory after which no more glideins are requested in the factory. Since this classad vanishes after its expiration, glideins not being requested makes sense since the classad is no longer present.This very same behavior of glideins being requested even though the DE job queue is empty was also observed in a couple of instances:
source1
went offline. Onlytest_channel
was steady butresource_request
went offline. This further confirmed that it is something on the DE side.The text was updated successfully, but these errors were encountered: