Fix closing sessions #6114

tofarr · 2025-01-07T15:34:24Z

End-user friendly description of the problem this fixes or functionality that this introduces

This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.

Summary of Changes

Added user_id tracking to sessions for better user-specific resource management
Implemented proper closing of stale sessions to prevent resource leaks
Added "agent stopped" event emission for better frontend state management
Enhanced recovery mechanism to preserve workspace/files after disconnection
Added smart session management for handling multiple conversations

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

Start a conversation
Disconnect
Restart the server
Verify workspace/files are preserved

Conversation Limits

Start 4 conversations in different tabs
First conversation goes to "agent stopped"
Sending a new message starts it back up, and another conversation goes to "agent stopped"
Verify workspace is totally recovered

Testing Instructions

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:6769659-nikolaik   --name openhands-app-6769659   docker.all-hands.dev/all-hands-ai/openhands:6769659
  -p 3000:3000 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --add-host host.docker.internal:host-gateway \
  -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2a0de2-nikolaik \
  --name openhands-app-b2a0de2 \
  docker.all-hands.dev/all-hands-ai/openhands:b2a0de2

rbren · 2025-01-07T19:57:52Z

openhands/server/session/manager.py

-            if sid in self._detached_conversations:
-                conversation, _ = self._detached_conversations.pop(sid)
-                self._active_conversations[sid] = (conversation, 1)
-                logger.info(f'Reusing detached conversation {sid}')
-                return conversation


why did we lose this?

I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

The concept of stored detached conversations was replaced with a general concept of session staleness. A session is considered stale and subject to close if...

It does not have any connections to it.
AND...

It has not had an update within the close_delay (Now 15 seconds by default).

Note: I think there may actually have been a bug here before my changes where the stale check was initialized along with the runloop and was not always being hit.

I assume you mean 15 minutes? 😅 15 seconds seems unbelievably low, just a quick tab away

Correct. It is 15 minutes. (I actually changed this from 15 seconds to 15 minutes on Monday:

OpenHands/openhands/core/config/sandbox_config.py

Line 63 in ff9058e

close_delay: int = 900

)

rbren · 2025-01-07T19:58:40Z

openhands/server/session/manager.py

+        sids = {sid for sid, _ in items}
+        return sids
+
+    async def get_running_agent_loops_in_cluster(


Suggested change

async def get_running_agent_loops_in_cluster(

async def get_running_agent_loops_remotely(

this seems like maybe a better name?

openhands/server/session/session.py

rbren · 2025-01-07T20:01:43Z

openhands/server/session/manager.py

-                logger.info(
-                    f'Attached conversations: {len(self._active_conversations)}'
-                )
-                logger.info(
-                    f'Detached conversations: {len(self._detached_conversations)}'
-                )


why remove?

rbren · 2025-01-07T20:03:22Z

openhands/server/session/manager.py

-            if sid in self._detached_conversations:
-                conversation, _ = self._detached_conversations.pop(sid)
-                self._active_conversations[sid] = (conversation, 1)
-                logger.info(f'Reusing detached conversation {sid}')
-                return conversation


I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

…nds into fix-closing-sessions

rbren · 2025-01-07T20:33:21Z

openhands/server/session/manager.py

-    async def _cleanup_session_later(self, sid: str):
-        # Once there have been no connections to a session for a reasonable period, we close it
-        try:
-            await asyncio.sleep(self.config.sandbox.close_delay)


We need to remove this config right?

AFAIK, there are OSS users that use this value - they have a use case where they want a session to persist for 8 hours while there is no connection to it. (As opposed to the 15 seconds we have by default)

Yep, we have been using a long N hours close_delay to keep our workspaces running around even after every browser closes.

With this new PR, is there a better way to achieve the same effect?

@diwu-sf - The settings you currently use should be fine - but you may get away with a shorter delay because the new behavior is that a conversation will be stopped if all three of the following are true:

It has not been updated in close_delay seconds.

There are no connections to it.

The agent is not in a running state. (This one is new!)

Now that I think about it, one thing that may affect you is that we have introduced a limit of 3 concurrent conversations per user. (So if you already have 3 running and start another it will kill one of the old ones regardless of the 3 criteria above - this is designed to stop the system crashing due to users trying to start too many concurrent docker containers). If this will affect you, we can introduce a config setting for this too.

…nds into fix-closing-sessions

kripper · 2025-01-08T19:03:04Z

I tested this:

I created a conversation/sandbox (worked fine).
I restarted the OH server
I joined the conversation URL

It failed, because the container couldn't be started.

Some remarks:

http://localhost:0 <-- it didn't get the port. See the fix here: [Bug]: OH fails to join existing conversations after an unclean exit #6148
This file was missing: No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'

Logs:

18:55:55 - openhands:INFO: docker_runtime.py:147 - [runtime e770430539174979bf2296e8c6d3fde5] Waiting for client to become ready at http://localhost:0...
18:55:55 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
Traceback (most recent call last):
  File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
    await self.runtime.connect()
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
    await call_sync_from_async(self._wait_until_alive)
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
    raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
18:55:55 - openhands:WARNING: agent_session.py:295 - State could not be restored: [Errno 2] No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO:     127.0.0.1:39866 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/vscode-url HTTP/1.1" 404 Not Found
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO:     127.0.0.1:39854 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/list-files HTTP/1.1" 404 Not Found

tofarr · 2025-01-08T19:19:12Z

@kripper - I think your issue may actually be unrelated (And sounds like a config issue), so I'm gonna respond on the ticket you opened

kripper · 2025-01-08T19:46:12Z

I can reproduce consistently:

It happens only the first time I execute "make run" after rebooting the box.
When I execute "make run" the second time it works fine (and so on).

Maybe something in /tmp?
/tmp is erased after reboot.

rbren

Looks like there are leaky file descriptors somewhere in here :/

If you run while true; do lsof -p $(pgrep -f "tracker_fd") | wc -l; sleep 1; done in a terminal while using the app, you can see the number go up continuously. (I suggest setting close_delay to something small to see this)

rbren · 2025-01-08T19:50:24Z

Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive

rbren · 2025-01-08T20:05:07Z

openhands/server/session/manager.py


+                await wait_all(self._close_session(sid) for sid in sid_to_close)


AFAICT this bails out if of the _close_session calls errors. Maybe we need to handle errors inside of _close_session?

wait_all calls all the given coroutines, and gathers any exceptions from them, which are then rethrown after all have resolved. (So there should be no need to handle exceptions separately)

I suppose the one place in _close_session that could benefit from having an additional try / except is the redis code where we publish the session_closing event.

Added, as I suppose it can't hurt!

kripper · 2025-01-08T21:39:22Z

I confirm this PR works fine and that #6148 (comment) is unrelated.

…nds into fix-closing-sessions

tofarr · 2025-01-09T23:11:36Z

Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive

I've reduced the default here to 15 seconds

enyst · 2025-01-10T00:37:37Z

openhands/controller/agent_controller.py

@@ -163,9 +163,6 @@ async def close(self) -> None:
                filter_hidden=True,
            )
        )
-
-        # unsubscribe from the event stream
-        self.event_stream.unsubscribe(EventStreamSubscriber.AGENT_CONTROLLER, self.id)


why was this problematic? The controller subscribes in its init, and unsubscribes in its close, it seemed to make sense

This resulted in a double unsubscribe. Closing the stream also unsubscribes all. (And is done before this):

OpenHands/openhands/events/stream.py

Line 109 in fcfbcb6

def close(self):

.

So we would get a constant message in the logs Callback not found during unsubscribe

But stream.close() appears to be called only from agent_session, so when running with an UI. What about running evals or other external scripts via main.py or other cli clients?

tofarr added 17 commits January 6, 2025 11:13

Closing stale sessions

3e364cb

Merge branch 'main' into fix-closing-sessions

18f02e7

User id

3750e5e

Added user_id to session

753c054

WIP

80603e4

Merge branch 'main' into fix-closing-sessions

3e5ad1a

Refactor conversations

187b3e8

Closing existing session

7c55584

Fix test

eb3bb1b

Test fixes

675a9a0

Merge branch 'main' into fix-closing-sessions

882a7e7

WIP

1845a41

Merge branch 'main' into fix-closing-sessions

c78c549

Emit stopped event when stopping session

559fa85

Merge branch 'main' into fix-closing-sessions

48e27a4

WIP

b2a0de2

Merge branch 'main' into fix-closing-sessions

7910c12

tofarr marked this pull request as ready for review January 7, 2025 19:51

rbren reviewed Jan 7, 2025

View reviewed changes

tofarr added 3 commits January 7, 2025 13:18

Changed name as suggested

9c649fc

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

bf9cd2a

…nds into fix-closing-sessions

Merge branch 'main' into fix-closing-sessions

8ff5e95

rbren reviewed Jan 7, 2025

View reviewed changes

tofarr added 6 commits January 7, 2025 13:58

Merge branch 'main' into fix-closing-sessions

be9eaac

Merge branch 'main' into fix-closing-sessions

bab53a0

Merge branch 'main' into fix-closing-sessions

88af9f8

Remote check fix

a008351

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

0c92868

…nds into fix-closing-sessions

Merge branch 'main' into fix-closing-sessions

0d0d5f9

tofarr added 2 commits January 8, 2025 12:01

Removed unneeded code

bcc4657

Revert

76496b6

Fix iteration bug

563a25d

tofarr mentioned this pull request Jan 8, 2025

[Bug]: OH fails to join existing conversations after an unclean exit #6148

Open

1 task

rbren requested changes Jan 8, 2025

View reviewed changes

rbren reviewed Jan 8, 2025

View reviewed changes

tofarr added 6 commits January 8, 2025 15:53

Merge branch 'main' into fix-closing-sessions

f98f71d

Added try catch for more resiliency

3611ad1

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

24db670

…nds into fix-closing-sessions

Setting keep runtime alive to false

b061c26

Merge branch 'main' into fix-closing-sessions

869f47d

Merge branch 'main' into fix-closing-sessions

b77c8be

tofarr force-pushed the fix-closing-sessions branch from 0463fe3 to b77c8be Compare January 9, 2025 21:21

tofarr added 3 commits January 9, 2025 14:35

Fix for FD

d88e5cc

Event stream close handles its own unsubscribe

2af5ae0

Reduced default close delay to 60 seconds

43b7754

Changed close delay default to 15 seconds

78416b4

enyst reviewed Jan 10, 2025

View reviewed changes

tofarr requested a review from rbren January 10, 2025 15:21

tofarr added 5 commits January 10, 2025 09:13

Merge branch 'main' into fix-closing-sessions

0f92daf

Updated started at

73d42f9

More concise logging

2d7ef56

Fix FD Leak

6ea9910

Fix for more file descriptor leaks

6769659

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix closing sessions #6114

Fix closing sessions #6114

tofarr commented Jan 7, 2025 •

edited by github-actions bot

Loading

rbren Jan 7, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025 •

edited

Loading

enyst Jan 8, 2025

tofarr Jan 8, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025

rbren Jan 7, 2025

rbren Jan 7, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025 •

edited

Loading

diwu-sf Jan 8, 2025

tofarr Jan 9, 2025 •

edited

Loading

kripper commented Jan 8, 2025 •

edited

Loading

tofarr commented Jan 8, 2025

kripper commented Jan 8, 2025 •

edited

Loading

rbren left a comment •

edited

Loading

rbren commented Jan 8, 2025

rbren Jan 8, 2025

tofarr Jan 8, 2025

kripper commented Jan 8, 2025

tofarr commented Jan 9, 2025 •

edited

Loading

enyst Jan 10, 2025

tofarr Jan 10, 2025 •

edited

Loading

enyst Jan 10, 2025

	async def get_running_agent_loops_in_cluster(
	async def get_running_agent_loops_remotely(


		await wait_all(self._close_session(sid) for sid in sid_to_close)

Fix closing sessions #6114

Are you sure you want to change the base?

Fix closing sessions #6114

Conversation

tofarr commented Jan 7, 2025 • edited by github-actions bot Loading

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

Conversation Limits

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

kripper commented Jan 8, 2025 • edited Loading

tofarr commented Jan 8, 2025

kripper commented Jan 8, 2025 • edited Loading

rbren left a comment • edited Loading

Choose a reason for hiding this comment

rbren commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kripper commented Jan 8, 2025

tofarr commented Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

tofarr Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr commented Jan 7, 2025 •

edited by github-actions bot

Loading

tofarr Jan 7, 2025 •

edited

Loading

tofarr Jan 7, 2025 •

edited

Loading

tofarr Jan 9, 2025 •

edited

Loading

kripper commented Jan 8, 2025 •

edited

Loading

kripper commented Jan 8, 2025 •

edited

Loading

rbren left a comment •

edited

Loading

tofarr commented Jan 9, 2025 •

edited

Loading

tofarr Jan 10, 2025 •

edited

Loading