Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CockroachDB doesnt start and init doesnt seem to launch #402

Open
DelaunayAntoine opened this issue Jul 18, 2024 · 11 comments
Open

CockroachDB doesnt start and init doesnt seem to launch #402

DelaunayAntoine opened this issue Jul 18, 2024 · 11 comments

Comments

@DelaunayAntoine
Copy link

Hello everyone,

I would like to deploy cockroachDB using helm but the problem is that the cluster can't start and I get this error that keeps appearing:
Error I240718 13:09:06.023104 191 server/init.go:405 ⋮ [T1,Vsystem,n?] 37 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry

Can you help me by giving me some hints on how to fix the problem?

Here's the entire log file and the values.yaml file
db.txt
values-cockroach.txt

I'm using cockroach version 24.1.1
The chart 13.0.1

What do you expect to see ?

The cockroach cluster launching just fine

What happened

Error I240718 13:09:06.023104 191 server/init.go:405 ⋮ [T1,Vsystem,n?] 37 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry

@lknite
Copy link

lknite commented Aug 26, 2024

I'm seeing this as well. Did you figure it out?

I see this in the log, it looks like its trying the wrong url to the pods:

W240826 16:03:30.251454 142 server/init.go:407 ⋮ [T1,Vsystem,n?] 37  outgoing join rpc to ‹keycloak-cockroachdb-1.keycloak-cockroachdb.keycloak.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup keycloak-cockroachdb-1.keycloak-cockroachdb.keycloak.svc.cluster.local: no such host"›
I240826 16:03:30.258373 142 server/init.go:405 ⋮ [T1,Vsystem,n?] 38  ‹keycloak-cockroachdb-2.keycloak-cockroachdb.keycloak.svc.cluster.local:26257› is itself waiting for init, will retr

In my case its adding in 'keycloak-cockroachdb.' and in your case its adding in 'cockroachdb.', which it looks like it shouldn't be.

@apavarnitsyn
Copy link

apavarnitsyn commented Sep 24, 2024

I've got the same problem with the latest 14.0.3 chart.
I suppose that the reason is in helm hooks annotations of init job template.
https://github.com/cockroachdb/helm-charts/blob/master/cockroachdb/templates/job.init.yaml#L22
Post-install hook can't be triggered because the stateful set is not ready.
As a workaround you may deploy the init job manifest from the template manually.

@udnay
Copy link

udnay commented Sep 24, 2024

Are either of you able to share your values file? A redacted version is likely fine, just to see what overrides you have set. I have. not ben able to reproduce this with the default values.

@lknite
Copy link

lknite commented Nov 9, 2024

@udnay , here ya go:

$ cat Chart.yaml 
apiVersion: v2
name: jellyfin
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
appVersion: "1.0"

dependencies:
- name: jellyfin
  version: 2.1.0
  repository: https://jellyfin.github.io/jellyfin-helm
- name: nats
  version: 1.1.10
  repository: https://nats-io.github.io/k8s/helm/charts/
- name: cockroachdb
  version: 14.0.5
  repository: https://charts.cockroachdb.com

$ cat values.yaml 
nats:

  natsBox:
    enabled: false

Result:

$ k --context prod-admin@prod -n jellyfin get ing,pvc,all
NAME                                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/datadir-jellyfin-cockroachdb-0   Bound    pvc-1098f4ee-2a61-4dc0-944e-d4af39b1e95a   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/datadir-jellyfin-cockroachdb-1   Bound    pvc-6daf0dfa-1f12-432d-9c22-8636433d1c82   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/datadir-jellyfin-cockroachdb-2   Bound    pvc-ff28ab8d-ea1a-46cd-9d07-ef8d6367899f   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/jellyfin-config                  Bound    pvc-de299b6e-a5b4-4926-a8da-e70f93c9fcfa   5Gi        RWO            cephfs         <unset>                 11m
persistentvolumeclaim/jellyfin-media                   Bound    pvc-f921cdea-c8a9-4c0c-a98b-fb46368fa90b   25Gi       RWO            cephfs         <unset>                 11m

NAME                            READY   STATUS    RESTARTS        AGE
pod/jellyfin-6898c4c4bf-m2jl6   1/1     Running   0               11m
pod/jellyfin-cockroachdb-0      0/1     Running   1 (5m3s ago)    11m
pod/jellyfin-cockroachdb-1      0/1     Running   1 (4m27s ago)   11m
pod/jellyfin-cockroachdb-2      0/1     Running   1 (4m25s ago)   11m
pod/jellyfin-nats-0             2/2     Running   0               11m

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
service/jellyfin                      ClusterIP   10.105.39.215   <none>        8096/TCP             11m
service/jellyfin-cockroachdb          ClusterIP   None            <none>        26257/TCP,8080/TCP   11m
service/jellyfin-cockroachdb-public   ClusterIP   10.110.42.45    <none>        26257/TCP,8080/TCP   11m
service/jellyfin-nats                 ClusterIP   10.97.28.92     <none>        4222/TCP             11m
service/jellyfin-nats-headless        ClusterIP   None            <none>        4222/TCP,8222/TCP    11m

Logs of each cockroachdb pod show:

$ k --context prod-admin@prod -n jellyfin logs -f jellyfin-cockroachdb-0 
Defaulted container "db" out of: db, copy-certs (init)
++ hostname
+ exec /cockroach/cockroach start --join=jellyfin-cockroachdb-0.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257,jellyfin-cockroachdb-1.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257,jellyfin-cockroachdb-2.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257 --advertise-host=jellyfin-cockroachdb-0.jellyfin-cockroachdb.jellyfin.svc.cluster.local --certs-dir=/cockroach/cockroach-certs/ --http-port=8080
*
* WARNING: Running a server without --sql-addr, with a combined RPC/SQL listener, is deprecated.
* This feature will be removed in a later version of CockroachDB.
*
*
* INFO: initial startup completed.
* Node will now attempt to join a running cluster, or wait for `cockroach init`.
* Client connections will be accepted after this completes successfully.
* Check the log file(s) for progress. 
*
*
* WARNING: The server appears to be unable to contact the other nodes in the cluster. Please try:
* 
* - starting the other nodes, if you haven't already;
* - double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
* - running the 'cockroach init' command if you are trying to initialize a new cluster.
* 
* If problems persist, please see https://www.cockroachlabs.com/docs/v24.2/cluster-setup-troubleshooting.html.
*

@sebzimmermann
Copy link

sebzimmermann commented Dec 5, 2024

I have the same issue.

I241205 10:47:05.134177 269 server/init.go:446 ⋮ [T1,Vsystem,n?] 575 ‹cockroachdb-0.cockroachdb.mynamespace.svc.cluster.local:26257› is itself waiting for init, will retry
etc.

@jonasbadstuebner
Copy link

I think for reproducing this issue, --wait has to be set on manual helm install, because this changes how/when the hooks are ran:

Note that if the --wait flag is set, the library will wait until all resources are in a ready state and will not run the post-install hook until they are ready

ref: https://helm.sh/docs/topics/charts_hooks/#hooks-and-the-release-lifecycle

Otherwise the deadlock occurs, where the Job/Hook is waiting for the StatefulSet to become ready, but the StatefulSet needs the Job to run the init before it can become ready.

The solution in #195 seems to have been #195 (comment):

with FluxV2 and Helm […] what did the trick was spec.install.disableWait: true

I did not test this first hand (yet). But it lines up with what I read and know about Helm. If people confirm this working, maybe a note somewhere in the chart README would prevent people to get stuck at this point again.

@nihr43
Copy link

nihr43 commented Jan 21, 2025

spec.install.disableWait: true

As in the spec of Job "cockroachdb-init"? I don't even see such a Job created. Just the self signer stuff.
Chart cockroachdb-15.0.5, all default values no overrides.
...
or in values? I see no spec.install in values.yaml master branch

@jonasbadstuebner
Copy link

Also this issue is a duplicate of #69. (There are so many issues referencing the init job on install...)

My suggestion, even though I didn't try it, would be to combine Helm's .Release.IsInstall with .spec.ttlSecondsAfterFinished for Kubernetes Jobs.
Set the TTL to 0 on install (via Helm templating), otherwise let it be the hook it was meant to be.

That way the Job could be deployed as a simple Job, not a Hook, on install. And run as a hook on upgrade (since upgrades don't seem so cause any issues anymore?).

I plan on testing this soon, as I am currently reinstalling a cluster of mine, but if someone else is faster than I am, please try this.

@jonasbadstuebner
Copy link

spec.install.disableWait: true

As in the spec of Job "cockroachdb-init"? I don't even see such a Job created. Just the self signer stuff. Chart cockroachdb-15.0.5, all default values no overrides. ... or in values? I see no spec.install in values.yaml master branch

This only applies if you use Flux(v2) to deploy cockroachdb. The spec is in reference to the HelmRelease CRD of Flux.

If you install cockroachdb with an helm install command, make sure to not set --wait, then the hook should run.

@nihr43
Copy link

nihr43 commented Jan 21, 2025

Got it. I'm wrapping in Opentofu, which was defaulting to --wait for me. Everything works with wait = false in .tf
thanks

@jonasbadstuebner
Copy link

For reference, my idea to solve this independently of Helm's --wait flag would be (something like) this:
⚠ Not tested yet (therefore draft PR) ⚠
#450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants