-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334
Conversation
openhands/runtime/remote/runtime.py
Outdated
logger.info( | ||
f'Waiting for runtime pod to be active. Current status: {pod_status}' | ||
) | ||
if pod_status == 'Running': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a chance it could be Running
but not Ready
--in fact I'm pretty sure that's the case
We should maybe put Ready
as a new pod_status, and then we wouldn't need to check /alive
(since Ready waits for the /alive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is exactly the reason why i keep the /alive
call! I think having a Ready
status (use the same probe we had in the cluster) will be the most ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime API side now returns Ready
pod status when readiness probe gives true - this PR should be ready to merge!
Co-authored-by: Robert Brennan <[email protected]>
Co-authored-by: Xingyao Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍰 Good to get this in
End-user friendly description of the problem this fixes or functionality that this introduces
Improve the reliability of Remote Runtime.
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR use
/runtime/<runtime_id>
endpoint to poll runtime pod info, and will start action execution once the pod is in 'Running' state. It will throw an error properly if the pod fails and/or is in a "Not Found" state after a fixed amount of retries.Need to wait until #4325 is merged
Link of any specific issues this addresses