[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

xingyaoww · 2024-10-11T05:05:24Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improve the reliability of Remote Runtime.

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR use /runtime/<runtime_id> endpoint to poll runtime pod info, and will start action execution once the pod is in 'Running' state. It will throw an error properly if the pod fails and/or is in a "Not Found" state after a fixed amount of retries.

Need to wait until #4325 is merged

Link of any specific issues this addresses

This reverts commit e986e78.

openhands/runtime/remote/runtime.py

rbren · 2024-10-11T14:37:36Z

openhands/runtime/remote/runtime.py

+            logger.info(
+                f'Waiting for runtime pod to be active. Current status: {pod_status}'
+            )
+            if pod_status == 'Running':


I think there's a chance it could be Running but not Ready--in fact I'm pretty sure that's the case

We should maybe put Ready as a new pod_status, and then we wouldn't need to check /alive (since Ready waits for the /alive)

Yep, this is exactly the reason why i keep the /alive call! I think having a Ready status (use the same probe we had in the cluster) will be the most ideal.

runtime API side now returns Ready pod status when readiness probe gives true - this PR should be ready to merge!

Co-authored-by: Robert Brennan <[email protected]>

…e-runtime-alive

Co-authored-by: Xingyao Wang <[email protected]>

This reverts commit 46121cf.

…e-runtime-alive

tofarr

🍰 Good to get this in

enyst and others added 5 commits October 10, 2024 19:19

Revert "chore(deps): bump protobuf from 4.25.5 to 5.28.2 (#4214)"

5bfff6c

This reverts commit e986e78.

update

9562c05

try to pin opentelemetry

41141c9

poll runtime info to _wait_until_alive

2efa42e

add timeout for not found state

40d88c8

xingyaoww requested review from rbren and tofarr October 11, 2024 05:05

sleep for not found

7a2d5e8

rbren reviewed Oct 11, 2024

View reviewed changes

openhands/runtime/remote/runtime.py Outdated Show resolved Hide resolved

rbren reviewed Oct 11, 2024

View reviewed changes

xingyaoww and others added 3 commits October 11, 2024 22:42

Update openhands/runtime/remote/runtime.py

a430200

Co-authored-by: Robert Brennan <[email protected]>

Merge commit '2692c0c8fd98bd8ff1bb88441e749dfbd53437e1' into xw/remot…

86c6139

…e-runtime-alive

check for Ready instead of Running

77fa2cb

xingyaoww mentioned this pull request Oct 13, 2024

[Bug]: Descriptors cannot be created directly. #4356

Closed

1 task

xingyaoww and others added 7 commits October 13, 2024 06:45

pin version of opentelemetry to fix #4356

46121cf

Update pyproject.toml

ce58c55

Co-authored-by: Xingyao Wang <[email protected]>

Merge branch 'main' into enyst/protobuf

3eea3f7

poetry lock

c9ab4d7

poetry lock

3e4a4eb

Revert "pin version of opentelemetry to fix #4356"

90f43a1

This reverts commit 46121cf.

Merge commit '3e4a4eb8a3715a624c8b6b2c6911e0b96833ecb4' into xw/remot…

86af6ab

…e-runtime-alive

xingyaoww marked this pull request as ready for review October 13, 2024 17:24

tofarr approved these changes Oct 13, 2024

View reviewed changes

Merge branch 'main' into xw/remote-runtime-alive

f7912dc

xingyaoww enabled auto-merge (squash) October 13, 2024 19:54

xingyaoww merged commit 343cc87 into main Oct 13, 2024
14 checks passed

xingyaoww deleted the xw/remote-runtime-alive branch October 13, 2024 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

xingyaoww commented Oct 11, 2024 •

edited

Loading

rbren Oct 11, 2024

xingyaoww Oct 11, 2024

xingyaoww Oct 13, 2024

tofarr left a comment

[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

[remote runtime] poll runtime info to wait until alive instead of using long timeout #4334

Conversation

xingyaoww commented Oct 11, 2024 • edited Loading

rbren Oct 11, 2024

Choose a reason for hiding this comment

xingyaoww Oct 11, 2024

Choose a reason for hiding this comment

xingyaoww Oct 13, 2024

Choose a reason for hiding this comment

tofarr left a comment

Choose a reason for hiding this comment

xingyaoww commented Oct 11, 2024 •

edited

Loading