Bug: Autoscaler stuck #1627

bernardhalas · 2025-01-04T21:38:47Z

Claudie 0.9.2

Current Behaviour

Autoscaler seems stuck after terraformer. No signs of action in kube-eleven.

Expected Behaviour

4 nodes are added to the cluster.

Steps To Reproduce

Simple deployment with nginx created:
kubectl create deployment nginx --image=nginx

Added resources.requests of cpu: 1 and memory: 1 Gi. And then upscaled to 6 instances:
kubectl scale deployment/nginx --replicas=6

terraformer logs create 4 new nodes successfully. builder contains just:

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder

And kube-eleven has no indication of up-scale need after the initial cluster creation:

2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

Deleting builder and kube-eleven seems like doesn't change the state of things.

This was executed on the following InputManifest:

apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: gcp-example-manifest
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  providers:
    - name: gcp-1
      providerType: gcp
      secretRef:
        name: gcp-secret-1
        namespace: claudie

  nodePools:
    dynamic:
      - name: control-gcp
        providerSpec:
          name: gcp-1
          region: europe-west1
          zone: europe-west1-c
        count: 1
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206

      - name: compute-1-gcp
        providerSpec:
          name: gcp-1
          region: europe-west3
          zone: europe-west3-a
        count: 2
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

      - name: compute-2-gcp
        providerSpec:
          name: gcp-1
          region: europe-west2
          zone: europe-west2-a
        autoscaler:
          min: 0
          max: 5
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

  kubernetes:
    clusters:
      - name: gcp-cluster
        version: v1.29.0
        network: 192.168.2.0/24
        pools:
          control:
            - control-gcp
          compute:
            - compute-1-gcp
            - compute-2-gcp

The text was updated successfully, but these errors were encountered:

Despire · 2025-01-06T06:59:04Z

@bernardhalas Did builder restart ? There is a ~3hours difference based on the logs

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder

2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

bernardhalas · 2025-01-07T09:08:07Z

Deleting builder and kube-eleven seems like doesn't change the state of things.

Yes, the pod has been force-restarted. And the messages were the same.

I tried to reproduce a few times, but I couldn't. I saw similar behavior once in down-sizing the nodepool by autoscaler. But that one occurred also just once. I'll spend more time on this if the situation allows, otherwise we'll close this down as unreproducible.

Despire · 2025-01-07T09:20:36Z

@bernardhalas

I assume the following happened, the builder service was restarted whether by you or OOM killed #1512.

When this happens The manifest will not be rescheduled again in 2 hours. Which I think is wrong, there has been an issue created for it long time ago #1316

Hard to say without logs of the crashed builder pod, though

bernardhalas · 2025-01-07T09:37:59Z

The builder was restarted intentionally as the autoscaler was already stuck before. Apologies for the confusion caused. So the problem occurred (VM was created, but not added to the cluster) and after ~3 hrs, I restarted the builder to see if it can fix the problem. It didn't help.

bernardhalas added the bug Something isn't working label Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Autoscaler stuck #1627

Bug: Autoscaler stuck #1627

bernardhalas commented Jan 4, 2025

Despire commented Jan 6, 2025 •

edited

Loading

bernardhalas commented Jan 7, 2025 •

edited

Loading

Despire commented Jan 7, 2025 •

edited

Loading

bernardhalas commented Jan 7, 2025

Bug: Autoscaler stuck #1627

Bug: Autoscaler stuck #1627

Comments

bernardhalas commented Jan 4, 2025

Current Behaviour

Expected Behaviour

Steps To Reproduce

Despire commented Jan 6, 2025 • edited Loading

bernardhalas commented Jan 7, 2025 • edited Loading

Despire commented Jan 7, 2025 • edited Loading

bernardhalas commented Jan 7, 2025

Despire commented Jan 6, 2025 •

edited

Loading

bernardhalas commented Jan 7, 2025 •

edited

Loading

Despire commented Jan 7, 2025 •

edited

Loading