Investigate operator restart re: leader election #116

trent-s · 2024-08-20T05:18:59Z

8 restarts in about 24 hours.... Does not look like memleak.

2024-08-20T02:50:51Z	INFO	susql	SusQL configuration values at runtime
2024-08-20T02:50:51Z	INFO	susql	metricsAddr=127.0.0.1:9999
2024-08-20T02:50:51Z	INFO	susql	enableLeaderElection=true
2024-08-20T02:50:51Z	INFO	susql	probeAddr=:8081
2024-08-20T02:50:51Z	INFO	susql	keplerPrometheusUrl=https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
2024-08-20T02:50:51Z	INFO	susql	keplerMetricName=kepler_container_joules_total
2024-08-20T02:50:51Z	INFO	susql	susqlPrometheusMetricsUrl=http://0.0.0.0:8082
2024-08-20T02:50:51Z	INFO	susql	susqlPrometheusDatabaseUrl=https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
2024-08-20T02:50:51Z	INFO	susql	samplingRate=2
2024-08-20T02:50:51Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:9999"}
2024-08-20T02:50:51Z	INFO	susql	Setting up labelGroupReconciler.
2024-08-20T02:50:51Z	LEVEL(-5)	susql	[SetupWithManager] Initializing Metrics Exporter.
2024-08-20T02:50:51Z	LEVEL(-5)	susql	Entering InitializeMetricsExporter().
2024-08-20T02:50:51Z	LEVEL(-2)	susql	[InitializeMetricsExporter] Serving metrics at '0.0.0.0:8082'...

2024-08-20T02:50:51Z	INFO	susql	Adding healthz check.
2024-08-20T02:50:51Z	INFO	susql	Starting manager.
2024-08-20T02:50:51Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:9999"}
2024-08-20T02:50:51Z	INFO	starting server	{"kind": "pprof", "addr": "127.0.0.1:6060"}
2024-08-20T02:50:51Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0820 02:50:51.299108       1 leaderelection.go:250] attempting to acquire leader lease default/cac735ee.ibm.com...
E0820 02:51:15.238669       1 leaderelection.go:332] error retrieving resource lock default/cac735ee.ibm.com: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
I0820 02:51:36.311677       1 leaderelection.go:260] successfully acquired lease default/cac735ee.ibm.com
2024-08-20T02:51:36Z	DEBUG	events	susql-operator-susql-controller-manager-648c4459f8-2zck9_41b70e33-6516-4299-9a10-8028401c735b became leader	{"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309040020"}, "reason": "LeaderElection"}
2024-08-20T02:51:36Z	INFO	Starting EventSource	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "source": "kind source: *v1.LabelGroup"}
2024-08-20T02:51:36Z	INFO	Starting EventSource	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "source": "kind source: *v1.Pod"}
2024-08-20T02:51:36Z	INFO	Starting Controller	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup"}
2024-08-20T02:51:36Z	INFO	Starting workers	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "worker count": 1}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.17 MB  TotalAlloc=49.16 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Unable to get podlist.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] LabelName: gpu
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Namespace: default
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] KubernetesLabels: map[string]string{"susql.label/1":"gpu"}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] podNamesinNamespace: []
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] ctx: &context.valueCtx{Context:(*context.valueCtx)(0xc001c84330), key:controller.reconcileIDKey{}, val:"05f0ca4f-5109-4878-9ca8-31c9d3c09896"}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.19 MB  TotalAlloc=49.17 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 746544.000000 for map[susql_label_1:model-1 susql_label_2:exp-1 susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.64 MB  TotalAlloc=49.62 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 388049.000000 for map[susql_label_1:model-1 susql_label_2:exp-2 susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.73 MB  TotalAlloc=49.72 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 1769013.000000 for map[susql_label_1:model-1 susql_label_2: susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.86 MB  TotalAlloc=49.84 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
E0820 02:51:50.609738       1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0820 02:51:50.609788       1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-20T02:51:50Z	ERROR	susql	problem running manager	{"error": "leader election lost"}
main.main
	/workspace/cmd/main.go:167
runtime.main
	/usr/local/go/src/runtime/proc.go:267
2024-08-20T02:51:50Z	DEBUG	events	susql-operator-susql-controller-manager-648c4459f8-2zck9_41b70e33-6516-4299-9a10-8028401c735b stopped leading	{"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309040038"}, "reason": "LeaderElection"}

The text was updated successfully, but these errors were encountered:

trent-s · 2024-08-22T01:04:30Z

Another example. Possible timer expiration issue with this one though...

2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile] Entered Reconcile().
2024-08-21T11:57:19Z LEVEL(-5) susql Memory: Alloc=15.02 MB TotalAlloc=510.59 MB Sys= 40.79 MB NumGC=76
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] Entered aggregating case.
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-g2dsc",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-g2dsc",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-55brf",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-55brf",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-cgz5x",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-cgz5x",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-rf5ts",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-rf5ts",container_namespace="default",mode="idle"})
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Results: '{} => 5255 @[1724241439.898]'
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Container id value is 5255.000000.
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] ActiveContainerIds: map[string]float64{"":5255}
E0821 11:57:20.798929 1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0821 11:57:20.798972 1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-21T11:57:20Z ERROR susql problem running manager {"error": "leader election lost"}
main.main
/workspace/cmd/main.go:195
runtime.main
/usr/local/go/src/runtime/proc.go:267
2024-08-21T11:57:20Z DEBUG events susql-operator-susql-controller-manager-75db56496d-pcxq4_e4fa0af1-fc7d-4043-a541-c46df0209e91 stopped leading {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"313705866"}, "reason": "LeaderElection"}

trent-s · 2024-08-26T07:21:24Z

Another one. Happened 10-20 minutes after deploying...

E0826 07:18:24.614409 1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0826 07:18:24.614456 1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-26T07:18:24Z ERROR susql problem running manager {"error": "leader election lost"}
main.main
/workspace/cmd/main.go:209
runtime.main
/usr/local/go/src/runtime/proc.go:267
2024-08-26T07:18:24Z INFO Stopping and waiting for non leader election runnables
2024-08-26T07:18:24Z INFO Stopping and waiting for leader election runnables

trent-s · 2024-09-04T05:36:57Z

Just another one:

E0904 02:35:02.198519       1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/cac735ee.ibm.com": context deadline exceeded
I0904 02:35:02.198573       1 leaderelection.go:285] failed to renew lease openshift-operators/cac735ee.ibm.com: timed out waiting for the condition
2024-09-04T02:35:02Z	INFO	Stopping and waiting for non leader election runnables
2024-09-04T02:35:02Z	ERROR	susql	problem running manager	{"error": "leader election lost"}
main.main
	/workspace/cmd/main.go:255
runtime.main
	/usr/local/go/src/runtime/proc.go:267
2024-09-04T02:35:02Z	INFO	Stopping and waiting for leader election runnables
2024-09-04T02:35:02Z	INFO	Stopping and waiting for caches
2024-09-04T02:35:02Z	INFO	Stopping and waiting for webhooks
2024-09-04T02:35:02Z	INFO	Stopping and waiting for HTTP servers
2024-09-04T02:35:02Z	DEBUG	events	susql-operator-susql-controller-manager-6d8c448658-vvbwj_87771c2b-09b3-406e-9d9c-723f5d92a85e stopped leading	{"type": "Normal", "object": {"kind":"Lease","namespace":"openshift-operators","name":"cac735ee.ibm.com","uid":"7984402b-c797-4f2e-9e4b-a750569f7c24","apiVersion":"coordination.k8s.io/v1","resourceVersion":"355442869"}, "reason": "LeaderElection"}

trent-s · 2024-09-09T06:13:46Z

We had only implemented the LEADER-ELECT=true case, which enables election of operator controller leaders. Since we had not implemented the false case yet, I just implemented it (in PR 128) to mean that the controller uses the "leader for life" approach. There are pros/cons for both approaches, but I am mostly interested in seeing if setting the "leader for life" approach solves the container restart issue we are seeing. (This will be available for easy installation via OperatorHub(s) shortly...)

trent-s · 2024-09-11T00:39:20Z

So far the leader 4 life approach seems to be a useful user configurable workaround.

trent-s · 2024-09-11T10:23:39Z

Apparently, this is a known issue with operators in general, and another common approach is to use longer timeout values. Since our design is to typically have a single long-running manager, I am thinking of making false (aka "leader for life") the default value as a solution for now.

trent-s changed the title ~~Investigate operator restarting re: leader election~~ Investigate operator restart re: leader election Aug 20, 2024

trent-s self-assigned this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate operator restart re: leader election #116

Investigate operator restart re: leader election #116

trent-s commented Aug 20, 2024 •

edited

Loading

trent-s commented Aug 22, 2024

trent-s commented Aug 26, 2024

trent-s commented Sep 4, 2024

trent-s commented Sep 9, 2024 •

edited

Loading

trent-s commented Sep 11, 2024

trent-s commented Sep 11, 2024

Investigate operator restart re: leader election #116

Investigate operator restart re: leader election #116

Comments

trent-s commented Aug 20, 2024 • edited Loading

trent-s commented Aug 22, 2024

trent-s commented Aug 26, 2024

trent-s commented Sep 4, 2024

trent-s commented Sep 9, 2024 • edited Loading

trent-s commented Sep 11, 2024

trent-s commented Sep 11, 2024

trent-s commented Aug 20, 2024 •

edited

Loading

trent-s commented Sep 9, 2024 •

edited

Loading