[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

bluenote10 · 2025-01-13T12:34:57Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the wait-gcs-ready init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.

A kubectl describe ourclustername-cpu-group-worker-2sbdj for instance reveals:

Init Containers:
  wait-gcs-ready:
    [...]
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 13 Jan 2025 12:17:25 +0100
      Finished:     Mon, 13 Jan 2025 12:18:07 +0100
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi

Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:

kuberay/ray-operator/controllers/ray/common/pod.go

Line 222 in 9068102

corev1.ResourceMemory: resource.MustParse("256Mi"),

Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the ray CLI:

kuberay/ray-operator/controllers/ray/common/pod.go

Line 192 in 9068102

if ray health-check --address %s:%s > /dev/null 2>&1; then

To get a rough estimate of the memory usage of that call, one can check with e.g.:

/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"

which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.

Reproduction script

It doesn't really matter, because it is Kubernetes configuration problem.

But we are basically submitting a simple hello world for testing:

import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))

Anything else

How often does the problem occur?

Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:

on our productive cluster it is close to 0% fortunately.
on our CI kind cluster it fails with ~90%.
on some developer machines it fail with ~100%.

We do not yet understand why the different environments have such different failure rates.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

rueian · 2025-01-13T17:29:23Z

Thank you, @bluenote10. I think your PR works, but we can probably do better by copying the resource requests and limits from the Ray container. Would you like to explore this idea?

rueian · 2025-01-13T17:43:39Z

cc @kevin85421

bluenote10 · 2025-01-13T20:21:53Z

we can probably do better by copying the resource requests and limits from the Ray container

I was wondering about that as well, but concluded that the memory requirements of that wait-gcs-ready init container is quite different from the regular container, right? Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed. It seems to make some sense to decouple the requirements of the init container and the main container, if I understand it correctly.

rueian · 2025-01-13T20:36:33Z

Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed.

That's correct. The wait-gcs-ready init container is definitely lighter than the actual ray container. But as far as I know, according to https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resource-sharing-within-containers, it is safe to copy resource requests and limits from the Ray container to init containers because it won't change the effective requests/limits.

kevin85421 · 2025-01-13T23:19:44Z

Hmm, health-check just sends a gRPC request to GCS IIRC. If it uses that many resources, it should be a bug in Ray.

kevin85421 · 2025-01-15T18:06:57Z

@bluenote10 which Ray version do you use and what's your K8s env (e.g. EKS? GKE? K8s version?)?

bluenote10 · 2025-01-20T10:31:39Z

@kevin85421 We are experiencing this mainly using kind on local developer machines and CI runners. The Kubernetes version is 1.30.6.

bluenote10 added bug Something isn't working triage labels Jan 13, 2025

bluenote10 linked a pull request Jan 13, 2025 that will close this issue

increase wait-gcs-ready memory limit #2736

Open

4 tasks

kevin85421 added stability Pertains to basic infrastructure stability and removed triage labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

bluenote10 commented Jan 13, 2025 •

edited

Loading

rueian commented Jan 13, 2025

rueian commented Jan 13, 2025

bluenote10 commented Jan 13, 2025

rueian commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

kevin85421 commented Jan 15, 2025

bluenote10 commented Jan 20, 2025

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

Comments

bluenote10 commented Jan 13, 2025 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

rueian commented Jan 13, 2025

rueian commented Jan 13, 2025

bluenote10 commented Jan 13, 2025

rueian commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

kevin85421 commented Jan 15, 2025

bluenote10 commented Jan 20, 2025

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

bluenote10 commented Jan 13, 2025 •

edited

Loading