-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kestrel hangs #36274
Comments
First thing I would recommend:
|
Also consider using YARP as a proxy https://github.com/microsoft/reverse-proxy/ 😄 . |
@davidfowl thanks for getting back to me! Are there easy instructions somewhere to install sdk from command-line in the docker container? |
Nginx is battle tested, industry standard, spec compliant and probably support our needs, I'm sure YARP is nice, but it's still in preview mode and I doubt it will have everything we need. |
https://docs.microsoft.com/en-us/dotnet/core/diagnostics/diagnostics-in-containers |
@HarelM Can you also please collect |
I'll do my best although I don't have a clue what this command does... |
@adityamandaleeka where should I run |
If you can reproduce the hang (and it does sound like thread starvation) then getting a process dump would be good enough to tell. |
I believe nginx is the problem as it's not easy to configure (it has bad defaults) and we have been hit by the same issue ourselves when doing benchmarking. If you are not able to get a netstat from the host that runs nginx, can you share the nginx configuration you are using (with redacted private information)? |
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment. If it is closed, feel free to comment when you are able to provide the additional information and we will re-investigate. See our Issue Management Policies for more information. |
Please do not close this issue. I will provide the data within the next few days... |
I updated to version 9.9.170 of our image (can be found in dockerhub).
The dump is 6Gb, so I'll have a hard time sending it.
Threadpool seems to be starved I guess...?
There were other specific IP addresses which seemed too specific which I removed from the list (Seemed like I shouldn't publish them), all of them were in ESTABLISHED state. I can't make much out of it besides that my suspicion about the OSMTokenValidator which I already mentioned above. |
Yep. Classic thread pool starvation:
This is one of the big clues. Then your stacks: |
Thanks for looking at it so fast!! In any case, thanks for the super super super fast response!! |
The following is most of our nginx configuration file content:
There are some static files configurations and http to https redirect too which I left out, but in general this is most of it, we just recently added the nginx due to problem with http/2 we needed to solve and nginx solved this well so the configuration itself is pretty basic. Any help would be appreciated. |
I don't believe this will force clients to use h2, but will allow for negotiating it. Hence they might not actually use h2 and fallback to 1.1. However, nginx doesn't use 1.1 by default, but 1.0. You need to allow HTTP 1.1 manually otherwise it will be a huge performance issue (a connection is created for each request).
Another thing I noticed in your source code is the use of |
NB: Other code review feedback, the way you call into logging is not exactly right.
should be written
It prevents unnecessary allocations (not really in this example since it's an Error level so probably will always be invoked in the end) by not creating the formatted string if the error level is too high. It also provides better logging when the provider supports structured logging. |
Isn't it this stack causing the problems? I don't know how to read the pstacks output but I;m assuming this is bad 😄 .
You can also run |
Yes, this is the OSMAccessTokenValidator I referenced before publishing the dump results - it was my main suspect. Thanks for the review of the config and my code, I'll do the relevant changes, but I'm not sure it fully explains the hang. |
I'm pretty sure it explains the hang 😄. Once all thread pool starvation is removed then you can start looking other places. When your threadpool queue is empty, then we can find the next reason for the hangs (if they are any). |
Even more important, this code, in a loop: https://github.com/IsraelHikingMap/Site/blob/d9b73f920ef00378604f14febcddf0a71109d55b/IsraelHiking.DataAccess/elevation-cache/ElevationDataStorage.cs#L62 |
I will close this issue since there is nothing that demonstrates an issue in Kestrel directly but there are many hints that the issues are in your code. Please add more comments as you think you have removed the usage of Task.Run in your code, and you still need help with investigating. |
Aside: This is a potential performance trap: Just make sure it's not over 85K. |
We are talking about 15 files here. I.e. 15 threads at most. |
https://devblogs.microsoft.com/dotnet/large-object-heap/
What does 15 files have to do with 15 threads? I missed the context... |
I was referring to this comment about the fact that this is running in a loop - but there are only 15 files there. There are still a few people using our server right now so I didn't want to bring it down just yet to test my change. I don't mind changing every line of code in my app to follow some guidelines to avoid thread pool starvation and dead locks, but it seems that I don't know how. I'm not doing anything super complex in my server. My code is a few dlls and still... |
This loop in particular might not be the root cause of the thread pool starvation. But you still need to remove the usage of Task.Run in every code path that is run per request. |
I hear ya, it's a hard problem to solve. We've done some work in .NET 6 to improve it |
@sebastienros Thanks for the info! I'll look for code path that are ran per request (mostly the stuff that run per request do CRUD using elastic so I tend to feel I'm safe there). In any case, I truly appreciate your dedication and quick response! |
Are you looking to do something like this? |
It seems like it's using two |
Well he's the one that wrote ConcurrentDictionary, so I would copy it if I were you 😉 |
Describe the bug
The kestrel server stops responding to requests a few seconds after it starts and being publicly available through nginx.
To Reproduce
I'm not sure how to create a minimalistic repro as this only happens to me in production environment where there are some users, But basically:
Exceptions (if any)
No exceptions as far as I can see in the logs.
Further technical details
More info
The following is the dockerfile I use:
https://github.com/IsraelHikingMap/Site/blob/9ec891d567ad219d15f957020bbabf3b1db47695/Dockerfile
The startup is fairly standard:
https://github.com/IsraelHikingMap/Site/blob/9ec891d567ad219d15f957020bbabf3b1db47695/IsraelHiking.Web/Startup.cs
I tired yesterday for an hour to make the container stay responsive to outside request but couldn't make it work properly and couldn't update the site's version.
The docker-compose that we use for our website is the following:
When doing the following the site hangs:
docker-compose pull israelhikingmap/website
docker-compose stop nginx
docker-compose up -d website
docker-compose up nginx
If there are logs I can add or something I can do on the container itself please let me know as this problem is critical now as I can't update the version on my site.
We've seen some outages lately and we've tried all kinds of fixes and changes to the site but now I just can't upgrade to the next version which is problematic.
If you have instructions on what needs to be done on the container itself from command line it would be great as the container only has the runtime version of .net core and not the SDK with all the tools etc.
It feels somewhat similar to the following issue which was closed a long time ago:
aspnet/KestrelHttpServer#1267
@halter73 I've seen that you were active back then, if you have some insights it would be great.
I'd be happy to provide any information that will help solving this issue.
Also worth mentioning that we've used kestrel as front facing without nginx and it had some hangs, but I think it's worse now with nginx as front facing. having said that we needed a proxy support for a tile server we have and aspnet proxy does not work well with http/2 which we wanted.
More information on that can be found here:
twitchax/AspNetCore.Proxy#78
I'm probably missing a lot of details that came to cause the current situation, but I'll be happy to share any information that might be helpful...
The text was updated successfully, but these errors were encountered: