Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Autoscaler stuck #1627

Open
bernardhalas opened this issue Jan 4, 2025 · 4 comments
Open

Bug: Autoscaler stuck #1627

bernardhalas opened this issue Jan 4, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@bernardhalas
Copy link
Member

Claudie 0.9.2

Current Behaviour

Autoscaler seems stuck after terraformer. No signs of action in kube-eleven.

Expected Behaviour

4 nodes are added to the cluster.

Steps To Reproduce

Simple deployment with nginx created:
kubectl create deployment nginx --image=nginx

Added resources.requests of cpu: 1 and memory: 1 Gi. And then upscaled to 6 instances:
kubectl scale deployment/nginx --replicas=6

terraformer logs create 4 new nodes successfully. builder contains just:

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder

And kube-eleven has no indication of up-scale need after the initial cluster creation:

2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

Deleting builder and kube-eleven seems like doesn't change the state of things.

This was executed on the following InputManifest:

apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: gcp-example-manifest
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  providers:
    - name: gcp-1
      providerType: gcp
      secretRef:
        name: gcp-secret-1
        namespace: claudie

  nodePools:
    dynamic:
      - name: control-gcp
        providerSpec:
          name: gcp-1
          region: europe-west1
          zone: europe-west1-c
        count: 1
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206

      - name: compute-1-gcp
        providerSpec:
          name: gcp-1
          region: europe-west3
          zone: europe-west3-a
        count: 2
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

      - name: compute-2-gcp
        providerSpec:
          name: gcp-1
          region: europe-west2
          zone: europe-west2-a
        autoscaler:
          min: 0
          max: 5
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

  kubernetes:
    clusters:
      - name: gcp-cluster
        version: v1.29.0
        network: 192.168.2.0/24
        pools:
          control:
            - control-gcp
          compute:
            - compute-1-gcp
            - compute-2-gcp
@bernardhalas bernardhalas added the bug Something isn't working label Jan 4, 2025
@Despire
Copy link
Contributor

Despire commented Jan 6, 2025

@bernardhalas Did builder restart ? There is a ~3hours difference based on the logs

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder
2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

@bernardhalas
Copy link
Member Author

bernardhalas commented Jan 7, 2025

Deleting builder and kube-eleven seems like doesn't change the state of things.

Yes, the pod has been force-restarted. And the messages were the same.

I tried to reproduce a few times, but I couldn't. I saw similar behavior once in down-sizing the nodepool by autoscaler. But that one occurred also just once. I'll spend more time on this if the situation allows, otherwise we'll close this down as unreproducible.

@Despire
Copy link
Contributor

Despire commented Jan 7, 2025

@bernardhalas

I assume the following happened, the builder service was restarted whether by you or OOM killed #1512.

When this happens The manifest will not be rescheduled again in 2 hours. Which I think is wrong, there has been an issue created for it long time ago #1316

Hard to say without logs of the crashed builder pod, though

@bernardhalas
Copy link
Member Author

The builder was restarted intentionally as the autoscaler was already stuck before. Apologies for the confusion caused. So the problem occurred (VM was created, but not added to the cluster) and after ~3 hrs, I restarted the builder to see if it can fix the problem. It didn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants