Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

Open
2 tasks done
bluenote10 opened this issue Jan 13, 2025 · 7 comments · May be fixed by #2736
Open
2 tasks done

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

bluenote10 opened this issue Jan 13, 2025 · 7 comments · May be fixed by #2736
Labels
bug Something isn't working stability Pertains to basic infrastructure stability

Comments

@bluenote10
Copy link

bluenote10 commented Jan 13, 2025

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the wait-gcs-ready init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.

A kubectl describe ourclustername-cpu-group-worker-2sbdj for instance reveals:

Init Containers:
  wait-gcs-ready:
    [...]
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 13 Jan 2025 12:17:25 +0100
      Finished:     Mon, 13 Jan 2025 12:18:07 +0100
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi

Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:

corev1.ResourceMemory: resource.MustParse("256Mi"),

Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the ray CLI:

if ray health-check --address %s:%s > /dev/null 2>&1; then

To get a rough estimate of the memory usage of that call, one can check with e.g.:

/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"

which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.

Reproduction script

It doesn't really matter, because it is Kubernetes configuration problem.

But we are basically submitting a simple hello world for testing:

import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))

Anything else

How often does the problem occur?

Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:

  • on our productive cluster it is close to 0% fortunately.
  • on our CI kind cluster it fails with ~90%.
  • on some developer machines it fail with ~100%.

We do not yet understand why the different environments have such different failure rates.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@bluenote10 bluenote10 added bug Something isn't working triage labels Jan 13, 2025
@bluenote10 bluenote10 linked a pull request Jan 13, 2025 that will close this issue
4 tasks
@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

Thank you, @bluenote10. I think your PR works, but we can probably do better by copying the resource requests and limits from the Ray container. Would you like to explore this idea?

@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

cc @kevin85421

@bluenote10
Copy link
Author

we can probably do better by copying the resource requests and limits from the Ray container

I was wondering about that as well, but concluded that the memory requirements of that wait-gcs-ready init container is quite different from the regular container, right? Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed. It seems to make some sense to decouple the requirements of the init container and the main container, if I understand it correctly.

@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed.

That's correct. The wait-gcs-ready init container is definitely lighter than the actual ray container. But as far as I know, according to https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resource-sharing-within-containers, it is safe to copy resource requests and limits from the Ray container to init containers because it won't change the effective requests/limits.

@kevin85421
Copy link
Member

Hmm, health-check just sends a gRPC request to GCS IIRC. If it uses that many resources, it should be a bug in Ray.

@kevin85421
Copy link
Member

@bluenote10 which Ray version do you use and what's your K8s env (e.g. EKS? GKE? K8s version?)?

@kevin85421 kevin85421 added stability Pertains to basic infrastructure stability and removed triage labels Jan 15, 2025
@bluenote10
Copy link
Author

@kevin85421 We are experiencing this mainly using kind on local developer machines and CI runners. The Kubernetes version is 1.30.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stability Pertains to basic infrastructure stability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants