-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix closing sessions #6114
base: main
Are you sure you want to change the base?
Fix closing sessions #6114
Conversation
if sid in self._detached_conversations: | ||
conversation, _ = self._detached_conversations.pop(sid) | ||
self._active_conversations[sid] = (conversation, 1) | ||
logger.info(f'Reusing detached conversation {sid}') | ||
return conversation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we lose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we just leave _attached_conversations
until the whole thing closes? That seems reasonable actually...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of stored detached conversations was replaced with a general concept of session staleness. A session is considered stale and subject to close if...
- It does not have any connections to it.
AND... - It has not had an update within the close_delay (Now 15 seconds by default).
Note: I think there may actually have been a bug here before my changes where the stale check was initialized along with the runloop and was not always being hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean 15 minutes? 😅 15 seconds seems unbelievably low, just a quick tab away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It is 15 minutes. (I actually changed this from 15 seconds to 15 minutes on Monday:
close_delay: int = 900 |
openhands/server/session/manager.py
Outdated
sids = {sid for sid, _ in items} | ||
return sids | ||
|
||
async def get_running_agent_loops_in_cluster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def get_running_agent_loops_in_cluster( | |
async def get_running_agent_loops_remotely( |
this seems like maybe a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
openhands/server/session/manager.py
Outdated
logger.info( | ||
f'Attached conversations: {len(self._active_conversations)}' | ||
) | ||
logger.info( | ||
f'Detached conversations: {len(self._detached_conversations)}' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove?
if sid in self._detached_conversations: | ||
conversation, _ = self._detached_conversations.pop(sid) | ||
self._active_conversations[sid] = (conversation, 1) | ||
logger.info(f'Reusing detached conversation {sid}') | ||
return conversation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we just leave _attached_conversations
until the whole thing closes? That seems reasonable actually...
async def _cleanup_session_later(self, sid: str): | ||
# Once there have been no connections to a session for a reasonable period, we close it | ||
try: | ||
await asyncio.sleep(self.config.sandbox.close_delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remove this config right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there are OSS users that use this value - they have a use case where they want a session to persist for 8 hours while there is no connection to it. (As opposed to the 15 seconds we have by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we have been using a long N hours close_delay
to keep our workspaces running around even after every browser closes.
With this new PR, is there a better way to achieve the same effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diwu-sf - The settings you currently use should be fine - but you may get away with a shorter delay because the new behavior is that a conversation will be stopped if all three of the following are true:
- It has not been updated in
close_delay
seconds. - There are no connections to it.
- The agent is not in a running state. (This one is new!)
Now that I think about it, one thing that may affect you is that we have introduced a limit of 3 concurrent conversations per user. (So if you already have 3 running and start another it will kill one of the old ones regardless of the 3 criteria above - this is designed to stop the system crashing due to users trying to start too many concurrent docker containers). If this will affect you, we can introduce a config setting for this too.
…nds into fix-closing-sessions
I tested this:
It failed, because the container couldn't be started. Some remarks:
Logs:
|
@kripper - I think your issue may actually be unrelated (And sounds like a config issue), so I'm gonna respond on the ticket you opened |
I can reproduce consistently: It happens only the first time I execute "make run" after rebooting the box. Maybe something in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are leaky file descriptors somewhere in here :/
If you run while true; do lsof -p $(pgrep -f "tracker_fd") | wc -l; sleep 1; done
in a terminal while using the app, you can see the number go up continuously. (I suggest setting close_delay to something small to see this)
Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive |
|
||
await wait_all(self._close_session(sid) for sid in sid_to_close) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this bails out if of the _close_session
calls errors. Maybe we need to handle errors inside of _close_session
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait_all calls all the given coroutines, and gathers any exceptions from them, which are then rethrown after all have resolved. (So there should be no need to handle exceptions separately)
I suppose the one place in _close_session that could benefit from having an additional try / except is the redis code where we publish the session_closing
event.
Added, as I suppose it can't hurt!
I confirm this PR works fine and that #6148 (comment) is unrelated. |
…nds into fix-closing-sessions
0463fe3
to
b77c8be
Compare
I've reduced the default here to 15 seconds |
@@ -163,9 +163,6 @@ async def close(self) -> None: | |||
filter_hidden=True, | |||
) | |||
) | |||
|
|||
# unsubscribe from the event stream | |||
self.event_stream.unsubscribe(EventStreamSubscriber.AGENT_CONTROLLER, self.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this problematic? The controller subscribes in its init, and unsubscribes in its close, it seemed to make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resulted in a double unsubscribe. Closing the stream also unsubscribes all. (And is done before this):
OpenHands/openhands/events/stream.py
Line 109 in fcfbcb6
def close(self): |
So we would get a constant message in the logs Callback not found during unsubscribe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But stream.close()
appears to be called only from agent_session
, so when running with an UI. What about running evals or other external scripts via main.py
or other cli clients?
End-user friendly description of the problem this fixes or functionality that this introduces
This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.
Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.
Summary of Changes
Acceptance Criteria for Multi-conversation Runtime Management
Recovery
Conversation Limits
Testing Instructions
To run this PR locally, use the following command: