Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

Merged
merged 17 commits into from
Oct 13, 2024

Conversation

xingyaoww
Copy link
Contributor

@xingyaoww xingyaoww commented Oct 11, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improve the reliability of Remote Runtime.


Give a summary of what the PR does, explaining any non-trivial design decisions

This PR use /runtime/<runtime_id> endpoint to poll runtime pod info, and will start action execution once the pod is in 'Running' state. It will throw an error properly if the pod fails and/or is in a "Not Found" state after a fixed amount of retries.

Need to wait until #4325 is merged


Link of any specific issues this addresses

logger.info(
f'Waiting for runtime pod to be active. Current status: {pod_status}'
)
if pod_status == 'Running':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a chance it could be Running but not Ready--in fact I'm pretty sure that's the case

We should maybe put Ready as a new pod_status, and then we wouldn't need to check /alive (since Ready waits for the /alive)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is exactly the reason why i keep the /alive call! I think having a Ready status (use the same probe we had in the cluster) will be the most ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime API side now returns Ready pod status when readiness probe gives true - this PR should be ready to merge!

@xingyaoww xingyaoww marked this pull request as ready for review October 13, 2024 17:24
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍰 Good to get this in

@xingyaoww xingyaoww enabled auto-merge (squash) October 13, 2024 19:54
@xingyaoww xingyaoww merged commit 343cc87 into main Oct 13, 2024
14 checks passed
@xingyaoww xingyaoww deleted the xw/remote-runtime-alive branch October 13, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants