-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] wait-gcs-ready
init-container going out-of-memory indefinitely (OOMKilled)
#2735
Comments
Thank you, @bluenote10. I think your PR works, but we can probably do better by copying the resource requests and limits from the Ray container. Would you like to explore this idea? |
cc @kevin85421 |
I was wondering about that as well, but concluded that the memory requirements of that |
That's correct. The |
Hmm, |
@bluenote10 which Ray version do you use and what's your K8s env (e.g. EKS? GKE? K8s version?)? |
@kevin85421 We are experiencing this mainly using |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the
wait-gcs-ready
init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.A
kubectl describe ourclustername-cpu-group-worker-2sbdj
for instance reveals:Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:
kuberay/ray-operator/controllers/ray/common/pod.go
Line 222 in 9068102
Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the
ray
CLI:kuberay/ray-operator/controllers/ray/common/pod.go
Line 192 in 9068102
To get a rough estimate of the memory usage of that call, one can check with e.g.:
which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.
Reproduction script
It doesn't really matter, because it is Kubernetes configuration problem.
But we are basically submitting a simple hello world for testing:
Anything else
How often does the problem occur?
Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:
We do not yet understand why the different environments have such different failure rates.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: