-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update backpressure implementation with runtime configurable threshold #4151
base: main
Are you sure you want to change the base?
Conversation
NOTE: i've only done some limited testing and everything seems to work correctly (i.e. no deadlocks) almost all of the time (i have one last edge case bug that i'm still tracking down) but more extensive testing is required to confirm that i didn't miss something. Additionally, i have not run any performance tests as i don't really have a good environment to run those in. Aside from that, this is ready for folks to give some feedback on as i don't expect things to change drastically to resolve that last issue i'm tracking down. |
I'm probably not going to have a chance to perf test this for a couple weeks. Sorry for the slow pace. |
@dipinhora when you have time, can you run the overload program with the existing implementation and give an idea of where memory generally stabilizes across several local runs and then do the same with the new implementation with a few different settings including the one that should mimic the current implementation? I would do it but I'm already not going to have a chance to perf test unless the weekend after the 4th. |
@SeanTAllen no problem... the latest incarnation of this that i just force pushed should have very similar performance characteristics as the current backpressure logic since it reuses almost all of it. |
@dipinhora can you give me different settings that you think would be important to use across however many different variations I test? |
Also, this PR not longer a draft and is ready for a more thorough review. |
@SeanTAllen i did the following to confirm things are working as expected:
The behavior with i don't really have explicit suggestions in terms of number of variations to test but i would recommend using |
Yes, i'll run the overload/fan-in/message-bench/etc in a few different ways (i've already done so with |
|
|
@SeanTAllen i've added comparison runs for |
I can see this is going to take a while to review. |
this needs a rebase btw |
@dipinhora do you think this is a PR where we could find time next week to go over synchronously? |
rebased onto latest
@SeanTAllen i can usually make time on most weekday evenings after 6 pm EST if that works for you (and also between 8 am EST and 9:30 am EST if you prefer morning time).. Weekends are sometimes a bit more flexible if that's better for you. Let me know what day/time would work best for you. |
i don't believe the CI failures are related to the changes in this PR. i'm going to retry them to see if they pass.
|
Not sure what's causing the i've tried setting up a windows environment but haven't been able to reproduce the issue. any ideas? |
Windows tests hanging would be new. What are you seeing in the CI specifically? |
The windows CI issues are showing the following symptoms:
The same issue occurred on this PR (https://github.com/ponylang/ponyc/runs/7102036651) and on the Also, i set up a windows dev environment and i am unable to reproduce this hanging/time ouot issue. |
I apologize for work being very busy right now and not having had time to get to this. |
this appears to be related to something going haywire with the WriteBuf test, no? @kulibali is this a new one? I don't remember seeing it any time recently and if I did in the past, I don't remember.
|
Not always, but yes, the WriteBuf test also tends to time out/fail often when the hang occurs. |
It happens. Ping me whenever you have time. |
At @dipinhora so... I'm looking at what is going on with this test. And it is backpressure adjacent. It calls "backpressure.apply" when the pipe it is writing to isn't available to take all the data. See: https://github.com/ponylang/ponyc/blob/main/packages/process/process_monitor.pony#L422 This is a very very weird looking one. It looks like this has been failing since at least a week ago, that is the oldest one I can see in the history on CirrusCI. |
Yes, i saw that but the same thing works fine on all the other OS's and on my local windows environment. Yes, the first failure was from two weeks ago when the latest commit was pushed (i retried a few time since then). Very annoying that it can't be reproduced locally. |
So I just did a PR against master and the Windows stuff passed just fine. It makes me think the issues you are hitting might be related to the changes. Somehow. |
Prior to this commit, the backpressure implementation would unschedule muted actors immediately after they sent a single message to a muted/overloaded/under pressure actor. While this works reasonably well, it does have some rough edges as noted in ponylang#3382 and ponylang#2980. This commit updates the backpressure implementation that can have its threshold for when to unschedule an actor sending to muted/overloaded/under pressure actors be dynamically controlled via a runtime cli argument. It does this by separating an actor being muted from when an actor is unscheduled where an actor is still muted immediately upon sending a message to a muted/overloaded/under pressure actor but the threshold controls when an actor is unscheduled after it has been muted resulting in a more gradual backpressure system. This threshold based backpressure should result in less stalls and more overall forward progress due to allowing muted actors to make some slow progress prior to being uncheduled. The updated backpressure logic is as follows (at a high level): * If sending a message to a muted/overloaded/under pressure actor, note that the sending actor should mute. * At the end of the behavior, mark the actor as `muted` if it is not already `muted` and save the receiver actor info. * At the end of the behavior, check to see if the actor is over the threshold. * If yes, unschedule actor and mark it as `muted and unscheduled` and finally add the sender and all muters to the scheduler mutemap (as used to happen previously after 1 message) * If no, stop processing messages but don't unschedule the actor * When an actor is no longer overloaded or under pressure, it will tell the schedulers to unmute all senders that were muted and unscheduled as a result of sending messages to it (same as previously) * When an actor runs it checks if it still needs to stay `muted` or not and if not, it tells the schedulers to unmute all senders that were muted and unscheduled as a result of sending messages to it * The actor also optimistically unmutes itself it successfully processes a full batch or all messages in its queue and tells the schedulers to unmute all senders that were muted and unscheduled as a result of sending messages to it The updated logic is functionally equivalent to the old backpressure implentation when the threshold is `1` message. There is now a new command line argument to control the threshold for throttling/muting called `--ponymsgstilmute`. This option is specifically added for power users who want to have more control over how the runtime handles backpressure. Using a value of `1` for `--ponymsgstilmute` will result in actors getting muted after sending a single message to a muted/overloaded/under pressure actor and be functionally equivalent to the previous backpressure implementation that did not employ gradual throttling. The threshold defaults to `10`.
Prior to this commit, an actor with no pending messages being unmuted could end up running concurrently on two scheduler threads due to a race condition. The specific issue is that we `unmute the actor` and then we `schedule the actor`. It is entirely possible that the unmuted actor could get scheduled on a different scheduler thread if another actor sent it a message at the same time as we schedule it to run on the current scheduler thread. This commit changes the logic to check whether the actor has any pending messages or not prior to unmuting it and will only reschedule the actor if the actor actually had pending messages to process already waiting in its queue.
…ntly" This reverts commit dbfbd9a. Reverting this because it's an incomplete fix and a proper fix will require larger changes. Will revisit later.
Prior to this commit, the backpressure implementation would
unschedule muted actors immediately after they sent a single
message to a muted/overloaded/under pressure actor. While this
works reasonably well, it does have some rough edges as noted
in #3382 and
#2980.
This commit updates the backpressure implementation that can
have its threshold for when to unschedule an actor sending
to muted/overloaded/under pressure actors be dynamically
controlled via a runtime cli argument. It does this by
separating an actor being muted from when an actor is
unscheduled where an actor is still muted immediately
upon sending a message to a muted/overloaded/under pressure
actor but the threshold controls when an actor is
unscheduled after it has been muted resulting in a more
gradual backpressure system. This threshold based
backpressure should result in less stalls and more overall
forward progress due to allowing muted actors to make some
slow progress prior to being uncheduled.
The updated backpressure logic is as follows (at a high level):
note that the sending actor should mute.
muted
if itis not already
muted
and save the receiver actor info.the threshold.
muted and unscheduled
and finally add the sender and all muters to the scheduler
mutemap (as used to happen previously after 1 message)
tell the schedulers to unmute all senders that were muted and
unscheduled as a result of sending messages to it
(same as previously)
muted
or notand if not, it tells the schedulers to unmute all senders that were
muted and unscheduled as a result of sending messages to it
processes a full batch or all messages in its queue and tells the
schedulers to unmute all senders that were muted and unscheduled
as a result of sending messages to it
The updated logic is functionally equivalent to the old backpressure
implentation when the threshold is
1
message.There is now a new command line argument to control the threshold
for throttling/muting called
--ponymsgstilmute
. This option isspecifically added for power users who want to have more control
over how the runtime handles backpressure. Using a value of
1
for
--ponymsgstilmute
will result in actors getting muted aftersending a single message to a muted/overloaded/under pressure
actor and be functionally equivalent to the previous backpressure
implementation that did not employ gradual throttling. The
threshold defaults to
10
.