-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Idle CPU in DotNetty #4636
Comments
The This would get rid of the akka.net/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs Lines 391 to 399 in a80ddd7
I don't have the setup/knowledge to measure the perf effects of this change, can test it? |
I made a branch https://github.com/Zetanova/akka.net/tree/helios-idle-cpu with a commit that changes the Please, can somebody run a test and benchmark Or explain me how to run get the Akka.MultiNodeTestRunner.exe started. |
Cc @to11mtm - guess I need to move up the time table on doing that review |
@Zetanova I’ll give your branch a try - OOF for a couple of days but I’ll get on it |
@Zetanova I'll try to run this through the paces as well in the next few days. :) |
I made a helios-io/DedicatedThreadPool fork https://github.com/Zetanova/DedicatedThreadPool/tree/try-channels The problem is that the benchmark does not count the spin waits / idle CPU CURRENT--------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark --------------- TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections [Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations WITH CHANNEL------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ---------- --------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark --------------- TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections [Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations ------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ---------- |
Event if the default implementation of the dotnet/runtime does not fit, To increase and decrease the thread-workers is easily possible. Even a Zero-Alive thread-worker scenario could be possible. Here are view link on the channel topicFrom the Pro: Detail blog post: |
@Zetanova I ran some tests against the branch in #4594 to see whether this helped/hurt. Background: On my local machine, under RemotePingPong, Streams TCP Transport gets up to 300k messages/sec if everything runs on the normal .NET Threadpool.
I think this could be on the right track, I know that |
@to11mtm thx for the test, because i didnt know what continues with Kestrel is using What's importent is to test the idle state of a cluster under windows and/or linux. On my on-premise k8s cluster it does not matter that much, but on an AWS or AZURE it does a lot. I will try now to implement an autoscaler for the DTP. |
@Zetanova I think it's definitely on the right path. If you can auto-scale that might help too, what I noticed in the profiler is we still have all of these threads waiting for channel reads very frequently, I'm not sure if there's a cleaner way to keep them fed... |
Sorry, one more note... I wonder whether we should peek at Orleans Schedulers for some inspiration? At [one point] (https://github.com/dotnet/orleans/pull/3792/files) they were actually using a variation of our Threadpool complete with credited borrowing of UnfairSemaphore. It doesn't look like they use that anymore, so perhaps we can look at how they evolved and take some lessons. |
I'm onboard with implementing good ideas no matter where they come from. The DedicatedThreadPool abstraction was something we created back in... must have been 2013 / 2014. It's ancient. .NET has evolved a lot since in terms of the types of scheduling primitives it allows. |
I think a major part of the issue with the I suggested a few ways of doing this - one was to put a tracer round in the queue and measure how long it took to make it to the front. The other was to measure the growth in the task queue and allocate threads based on growth trends. Both of these have costs in terms of complexity and raw throughput, but the advantage is that in less busy or sporadically busy system they're more efficient at conserving CPU utilization. |
Looks like the CLR solves this problem via a Hill-climbing algorithm to continually try to optimize the thread count https://github.com/dotnet/runtime/blob/4dc2ee1b5c0598ca02a69f63d03201129a3bf3f1/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.HillClimbing.cs |
Based on the data from this PR that @to11mtm referenced: dotnet/orleans#6261 An idea: the big problem we've tried to solve by having separate threadpools was ultimately caused by the idea of work queue prioritization - that some work, which is time sensitive, needs to have a shorter route to being actively worked on than others. The big obstacle we've run into historically with the default .NET Threadpool was that its work queue can grow quite large, especially with a large number of What if we solved this problem by having two different work queues routing to the same thread pool rather than two different work queues routing to separate thread pools? If we could move the The problems that could solve:
The downsides are that outside of the Akka.NET dispatchers, anyone can queue work onto the underlying threadpool - so we might see a return of the types of problems we had around Akka.NET 1.0 where time sensitive infrastructure tasks like Akka.Remote / Akka.Persistence time out due to the length of the work queue. I'd be open to experimenting with that approach too and ditching the idea of separate thread pools entirely. |
Perhaps then it makes sense to keep the existing one around if this route is taken? that way if you are unfortunately having to deal with noisy code for whatever reason in your system, you can at at least 'pick your poison'. This does fall into the category of 'Things that are easier to solve in Net Core 3.1+'; 3.1+ lets you look at the work queue counts, at that point we could 'spin up' additional threads if the work queue looks too long. |
Yes the DedicatedThreadPool is not ideal. 2-3 channels to queue work on priority inside a single dispatcher would be the why to go. The queue algo could be very simple like:
Maybe channel-3 is not needed and a flag to directly direct execute the work-item can be used. If an external source queues to much work on the ThreadPool, I will try to look into the Dispatcher next after DedicatedThreadPool. |
@to11mtm pls benchmark my commit again https://github.com/Zetanova/akka.net/tree/helios-idle-cpu If possible pls form a 5-7 node cluster and look at the idle CPU state. Maybe if somebody has time to explain to me how to start the benchmarks and MultiNode tests. |
@Zetanova - Looks like this last set of changes impacted thorughput negatively; it looks like either we are spinning up new threads too slowly, or there's some other overhead negatively impacting us as we try to ramp up. What I'm measuring is the Messages/Sec of RemotePingPong on [this branch])(https://github.com/to11mtm/akka.net/tree/remote-full-manual-protobuf-deser); If you can build it you should be able to run it easily enough. Edit: It's kinda all over the place with this set of changes, anywhere from 100,000 to 180,000 msg/sec
Unfortunately I don't have a cluster setup handy that I can use for testing this, Won't have time to set one up for quite some time either :( |
@to11mtm thx for the run. Currenty the scheduler checks every 50 work items to reschedule. But the main problem is not to support max throughput, |
@to11mtm I checked again and found a small error and made a new commit. misstake is Else it should be more or less the same like on the fist commit without the auto-scaler. It sets up MaxThread from the start and scales down only if there is very low work count. I could not run RemotePingPong because of some Null execption on startup, My CPU run with 'only' 60% thats because of the Intel Hyper-Threading |
@Aaronontheweb Could take a look? |
@Zetanova haven't been able to get RemotePingPong to run on my machine with these changes yet - it just idles without running |
@Aaronontheweb This is the simplest one and most likely the best performant |
It does work, but akka is not using DedicatedThreadPoolTaskScheduler only DedicatedThreadPool for the ForkJoinExecutor. |
I'm taking some notes as I go through this - we really have three issues here:
Solutions, in order of least risk to existing Akka.NET users / implementations:
I'm doing some work on item number 2 to assess how feasible that is - since that can descend into yak-shaving pretty quickly. Getting approach number 1 to work is more straightforward and @Zetanova has already done some good work there. It's just that I consider approach number 2 to be a better long-term solution to this problem, and if it's only marginally more expensive to implement that then that's what I'd prefer to do. |
Some benchmark data from some of @Zetanova's PRs on my machine (AMD Ryzen 1st generation) As a side note: looks like we significantly increased the number of messages written per round. That is going to crush the nuts of the first round of this benchmark due to the way batching is implemented - we can never hit the treshhold so long as the number of messages per round / per actor remains low on that first round. But, that's a good argument for leaving batching off by default I suppose.
|
@Aaronontheweb Thx for testing. In the 'helios-idle-cpu-pooled' branch is only a mode of the DedicatedThreadPoolTaskScheduler You can use it for your approach 2). If the dispatchers would use this TaskScheduler then WorkItems would be processes in a loop in parallel up to ProcessorCount and a pooled Thread would be released only after an empty WorkItems queue. If the .net ThreadPool is not creating threads fast enough, it could be manipulated with |
@Zetanova I think you have the right idea with your design thus far. After doing some tire-kicking on approach number 2 - that's a big hairy redesign that won't solve problems for people with idle CPU issues right now. I'm going to suggest that we try approach number 1 and get a fix out immediately so we can improve the Akka.NET experience for users running on 1.3 and 1.4 right now. Implementing approach number 2 will likely need to wait until Akka.NET v1.5. |
@Aaronontheweb I made now simple new commit. It replaces the ForkJoinExecutor with the TaskSchdulerExecuter PingPong works good, memory and GC got lower. Even with this change there will be most likely a high decrease in idle CPU. If possible pls test this one with RemotePingPong too, |
Will do - I'll take a look. I'm working on an idle CPU benchmark for DedicatedThreadPool now - if that works well I'll do one for Akka.NET too |
Working on some specs to actually measure this here: helios-io/DedicatedThreadPool#23 |
So in case you're wondering what I'm doing, here's my approach:
|
The I can't even reproduce the idle CPU issues at the moment - so it makes me wonder if the issues showing up in Akka.NET have another side effect (i.e. intermittent load applied by scheduler-driven messaging) that is creating the issue. I'm going to continue to play with this. |
Running an idle Cluster.WebCrawler cluster:
Lighthouse 2 has no connections - it's not included in the cluster. This tells me that there's something other than the DedicatedThreadPool design itself that is responsible for this. Even on a less powerful Intel machine I can't generate much idle CPU using just the DedicatedThreadPool. |
Interesting...
Thought:
Has anything been done to check if this is a resource constraint issue? HashedWheelTimer and Dotnetty Executor will each take one thread on their own, alongside whatever else each DTP winds up doing. |
yeah, that was my thinking too @to11mtm - I think it's a combination of factors. One thing I can do - make an |
Looks at everything needed to implement |
@Aaronontheweb The issue is only in a formed cluster with or without load. There can be no user-actors on the node. What makes most of "idle-cpu" usage is the spin-lock. If there is Absolut no work there are no spin-waits, The akka scheduler is ticking with 100ms @Aaronontheweb pls try a cluster with 3-5 nodes https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled |
https://github.com/Aaronontheweb/akka.net/tree/feature/IEventLoopGroup-dispatcher - tried moving the entire DotNetty We're working on multiple parallel attempts to address this. |
i am pretty sure that the idle load comes from a spin-wait of an event-handle and the component like DotNetty tick <40ms. Case A
Case B
If the timeout is very low <30ms or the signal of an NoOp-Tick comes very frequently <30ms If the timeout is low, the fix would be just to remove the wait on the signal only in "Case B / Point 5", Case B... |
I'm in agreement on the causes here - just working on how to safely reduce amount of "expensive non-work" occurring without creating additional problems. |
Achieved a 50% reduction in idle CPU here: #4678 (comment) |
I still have the issue with idle nodes more or less like in #4434
docker akka.net 1.4.11 dotnet 3.1.404 debug and release builds
All 7 nodes are idling and consume 100% (docker is limited to 3 cores)
The main hotpatch is still in dotnetty
messages/traffic is low the node is idling.
The text was updated successfully, but these errors were encountered: