Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayJob] Short-term failure to receive logs should not trigger a failure #2788

Open
1 of 2 tasks
Moonquakes opened this issue Jan 21, 2025 · 0 comments
Open
1 of 2 tasks
Labels
bug Something isn't working triage

Comments

@Moonquakes
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Currently, if the submitter pod cannot obtain the log information on the Ray Cluster through the GET API, it will fail directly. This may not meet the user's expectations. It should support continuous retries instead of direct failure.

At the same time, I am also curious why it is still necessary to use domain name splicing instead of directly accessing through IP addresses inside kuberay. In this way, the network should be more stable (one less step of DNS)

Reproduction script

Submit a RayJob, and then the GET LOG API fails due to network reasons

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Moonquakes Moonquakes added bug Something isn't working triage labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant