Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate operator restart re: leader election #116

Open
trent-s opened this issue Aug 20, 2024 · 6 comments
Open

Investigate operator restart re: leader election #116

trent-s opened this issue Aug 20, 2024 · 6 comments
Assignees

Comments

@trent-s
Copy link
Collaborator

trent-s commented Aug 20, 2024

8 restarts in about 24 hours.... Does not look like memleak.

2024-08-20T02:50:51Z	INFO	susql	SusQL configuration values at runtime
2024-08-20T02:50:51Z	INFO	susql	metricsAddr=127.0.0.1:9999
2024-08-20T02:50:51Z	INFO	susql	enableLeaderElection=true
2024-08-20T02:50:51Z	INFO	susql	probeAddr=:8081
2024-08-20T02:50:51Z	INFO	susql	keplerPrometheusUrl=https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
2024-08-20T02:50:51Z	INFO	susql	keplerMetricName=kepler_container_joules_total
2024-08-20T02:50:51Z	INFO	susql	susqlPrometheusMetricsUrl=http://0.0.0.0:8082
2024-08-20T02:50:51Z	INFO	susql	susqlPrometheusDatabaseUrl=https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
2024-08-20T02:50:51Z	INFO	susql	samplingRate=2
2024-08-20T02:50:51Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:9999"}
2024-08-20T02:50:51Z	INFO	susql	Setting up labelGroupReconciler.
2024-08-20T02:50:51Z	LEVEL(-5)	susql	[SetupWithManager] Initializing Metrics Exporter.
2024-08-20T02:50:51Z	LEVEL(-5)	susql	Entering InitializeMetricsExporter().
2024-08-20T02:50:51Z	LEVEL(-2)	susql	[InitializeMetricsExporter] Serving metrics at '0.0.0.0:8082'...

2024-08-20T02:50:51Z	INFO	susql	Adding healthz check.
2024-08-20T02:50:51Z	INFO	susql	Starting manager.
2024-08-20T02:50:51Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:9999"}
2024-08-20T02:50:51Z	INFO	starting server	{"kind": "pprof", "addr": "127.0.0.1:6060"}
2024-08-20T02:50:51Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0820 02:50:51.299108       1 leaderelection.go:250] attempting to acquire leader lease default/cac735ee.ibm.com...
E0820 02:51:15.238669       1 leaderelection.go:332] error retrieving resource lock default/cac735ee.ibm.com: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
I0820 02:51:36.311677       1 leaderelection.go:260] successfully acquired lease default/cac735ee.ibm.com
2024-08-20T02:51:36Z	DEBUG	events	susql-operator-susql-controller-manager-648c4459f8-2zck9_41b70e33-6516-4299-9a10-8028401c735b became leader	{"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309040020"}, "reason": "LeaderElection"}
2024-08-20T02:51:36Z	INFO	Starting EventSource	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "source": "kind source: *v1.LabelGroup"}
2024-08-20T02:51:36Z	INFO	Starting EventSource	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "source": "kind source: *v1.Pod"}
2024-08-20T02:51:36Z	INFO	Starting Controller	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup"}
2024-08-20T02:51:36Z	INFO	Starting workers	{"controller": "labelgroup", "controllerGroup": "susql.ibm.com", "controllerKind": "LabelGroup", "worker count": 1}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.17 MB  TotalAlloc=49.16 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Unable to get podlist.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] LabelName: gpu
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Namespace: default
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] KubernetesLabels: map[string]string{"susql.label/1":"gpu"}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] podNamesinNamespace: []
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] ctx: &context.valueCtx{Context:(*context.valueCtx)(0xc001c84330), key:controller.reconcileIDKey{}, val:"05f0ca4f-5109-4878-9ca8-31c9d3c09896"}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.19 MB  TotalAlloc=49.17 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 746544.000000 for map[susql_label_1:model-1 susql_label_2:exp-1 susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.64 MB  TotalAlloc=49.62 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 388049.000000 for map[susql_label_1:model-1 susql_label_2:exp-2 susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.73 MB  TotalAlloc=49.72 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-kgkp5",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-vx695",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[SetAggregatedEnergyForLabels] Setting energy 1769013.000000 for map[susql_label_1:model-1 susql_label_2: susql_label_3: susql_label_4: susql_label_5: susql_label_6:].
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile] Entered Reconcile().
2024-08-20T02:51:36Z	LEVEL(-5)	susql	Memory: Alloc=18.86 MB  TotalAlloc=49.84 MB  Sys= 39.85 MB  NumGC=10
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] Entered aggregating case.
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-jxnrd",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-cbjh5",container_namespace="default",mode="idle"})
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[GetMetricValuesForPodNames] Results: ''
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-20T02:51:36Z	LEVEL(-5)	susql	[Reconcile-Aggregating] ActiveContainerIds: map[string]float64{}
E0820 02:51:50.609738       1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0820 02:51:50.609788       1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-20T02:51:50Z	ERROR	susql	problem running manager	{"error": "leader election lost"}
main.main
	/workspace/cmd/main.go:167
runtime.main
	/usr/local/go/src/runtime/proc.go:267
2024-08-20T02:51:50Z	DEBUG	events	susql-operator-susql-controller-manager-648c4459f8-2zck9_41b70e33-6516-4299-9a10-8028401c735b stopped leading	{"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309040038"}, "reason": "LeaderElection"}
@trent-s trent-s changed the title Investigate operator restarting re: leader election Investigate operator restart re: leader election Aug 20, 2024
@trent-s
Copy link
Collaborator Author

trent-s commented Aug 22, 2024

Another example. Possible timer expiration issue with this one though...

2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile] Entered Reconcile().
2024-08-21T11:57:19Z LEVEL(-5) susql Memory: Alloc=15.02 MB TotalAlloc=510.59 MB Sys= 40.79 MB NumGC=76
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] Entered aggregating case.
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Query: sum(kepler_container_joules_total{pod_name="energy-consumer-job-g2dsc",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="energy-consumer-job-g2dsc",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-1-55brf",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-1-55brf",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-cgz5x",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-cgz5x",container_namespace="default",mode="idle"})+sum(kepler_container_joules_total{pod_name="training-job-2-rf5ts",container_namespace="default",mode="dynamic"})+sum(kepler_container_joules_total{pod_name="training-job-2-rf5ts",container_namespace="default",mode="idle"})
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Results: '{} => 5255 @[1724241439.898]'
2024-08-21T11:57:19Z LEVEL(-5) susql [GetMetricValuesForPodNames] Container id value is 5255.000000.
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] metricValues: map[string]float64{}
2024-08-21T11:57:19Z LEVEL(-5) susql [Reconcile-Aggregating] ActiveContainerIds: map[string]float64{"":5255}
E0821 11:57:20.798929 1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0821 11:57:20.798972 1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-21T11:57:20Z ERROR susql problem running manager {"error": "leader election lost"}
main.main
/workspace/cmd/main.go:195
runtime.main
/usr/local/go/src/runtime/proc.go:267
2024-08-21T11:57:20Z DEBUG events susql-operator-susql-controller-manager-75db56496d-pcxq4_e4fa0af1-fc7d-4043-a541-c46df0209e91 stopped leading {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cac735ee.ibm.com","uid":"e924c11e-9e55-44d3-ac88-306371e82433","apiVersion":"coordination.k8s.io/v1","resourceVersion":"313705866"}, "reason": "LeaderElection"}

@trent-s
Copy link
Collaborator Author

trent-s commented Aug 26, 2024

Another one. Happened 10-20 minutes after deploying...

E0826 07:18:24.614409 1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/cac735ee.ibm.com": context deadline exceeded
I0826 07:18:24.614456 1 leaderelection.go:285] failed to renew lease default/cac735ee.ibm.com: timed out waiting for the condition
2024-08-26T07:18:24Z ERROR susql problem running manager {"error": "leader election lost"}
main.main
/workspace/cmd/main.go:209
runtime.main
/usr/local/go/src/runtime/proc.go:267
2024-08-26T07:18:24Z INFO Stopping and waiting for non leader election runnables
2024-08-26T07:18:24Z INFO Stopping and waiting for leader election runnables

@trent-s
Copy link
Collaborator Author

trent-s commented Sep 4, 2024

Just another one:

E0904 02:35:02.198519       1 leaderelection.go:369] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/cac735ee.ibm.com": context deadline exceeded
I0904 02:35:02.198573       1 leaderelection.go:285] failed to renew lease openshift-operators/cac735ee.ibm.com: timed out waiting for the condition
2024-09-04T02:35:02Z	INFO	Stopping and waiting for non leader election runnables
2024-09-04T02:35:02Z	ERROR	susql	problem running manager	{"error": "leader election lost"}
main.main
	/workspace/cmd/main.go:255
runtime.main
	/usr/local/go/src/runtime/proc.go:267
2024-09-04T02:35:02Z	INFO	Stopping and waiting for leader election runnables
2024-09-04T02:35:02Z	INFO	Stopping and waiting for caches
2024-09-04T02:35:02Z	INFO	Stopping and waiting for webhooks
2024-09-04T02:35:02Z	INFO	Stopping and waiting for HTTP servers
2024-09-04T02:35:02Z	DEBUG	events	susql-operator-susql-controller-manager-6d8c448658-vvbwj_87771c2b-09b3-406e-9d9c-723f5d92a85e stopped leading	{"type": "Normal", "object": {"kind":"Lease","namespace":"openshift-operators","name":"cac735ee.ibm.com","uid":"7984402b-c797-4f2e-9e4b-a750569f7c24","apiVersion":"coordination.k8s.io/v1","resourceVersion":"355442869"}, "reason": "LeaderElection"}

@trent-s
Copy link
Collaborator Author

trent-s commented Sep 9, 2024

We had only implemented the LEADER-ELECT=true case, which enables election of operator controller leaders. Since we had not implemented the false case yet, I just implemented it (in PR 128) to mean that the controller uses the "leader for life" approach. There are pros/cons for both approaches, but I am mostly interested in seeing if setting the "leader for life" approach solves the container restart issue we are seeing. (This will be available for easy installation via OperatorHub(s) shortly...)

@trent-s
Copy link
Collaborator Author

trent-s commented Sep 11, 2024

So far the leader 4 life approach seems to be a useful user configurable workaround.

@trent-s trent-s self-assigned this Sep 11, 2024
@trent-s
Copy link
Collaborator Author

trent-s commented Sep 11, 2024

Apparently, this is a known issue with operators in general, and another common approach is to use longer timeout values. Since our design is to typically have a single long-running manager, I am thinking of making false (aka "leader for life") the default value as a solution for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant