Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kestrel hangs #36274

Closed
HarelM opened this issue Sep 8, 2021 · 33 comments
Closed

Kestrel hangs #36274

HarelM opened this issue Sep 8, 2021 · 33 comments
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel Needs: Attention 👋 This issue needs the attention of a contributor, typically because the OP has provided an update.

Comments

@HarelM
Copy link

HarelM commented Sep 8, 2021

Describe the bug

The kestrel server stops responding to requests a few seconds after it starts and being publicly available through nginx.

To Reproduce

I'm not sure how to create a minimalistic repro as this only happens to me in production environment where there are some users, But basically:

  1. Run aspnetcore kestrel server on a linux machine inside docker
  2. The container is working fine and I can see that on the machine that it is used, it is working as expected
  3. I load up an nginx service to open the website publicly and the server stop handling requests

Exceptions (if any)

No exceptions as far as I can see in the logs.

Further technical details

Host (useful for support):
  Version: 5.0.9
  Commit:  208e377a53

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 5.0.9 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 5.0.9 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET runtimes or SDKs:
  https://aka.ms/dotnet-download

More info

The following is the dockerfile I use:
https://github.com/IsraelHikingMap/Site/blob/9ec891d567ad219d15f957020bbabf3b1db47695/Dockerfile
The startup is fairly standard:
https://github.com/IsraelHikingMap/Site/blob/9ec891d567ad219d15f957020bbabf3b1db47695/IsraelHiking.Web/Startup.cs

I tired yesterday for an hour to make the container stay responsive to outside request but couldn't make it work properly and couldn't update the site's version.
The docker-compose that we use for our website is the following:

version: '3.4'
services:

  website:
    image: israelhikingmap/website:9.9.160
    env_file: .env
    environment:
      ASPNETCORE_ENVIRONMENT: Production
    volumes:
      - ./appsettings.json:/israelhiking/appsettings.json
      - some more folders
    ports:
      - 8080:80
    logging:
      options:
        max-size: "1g"
        max-file: "3"
    restart: "unless-stopped"

  nginx:
    image: "nginxinc/nginx-unprivileged"
    networks:
      - maptiles_tiles_conn
      - default
    ports:
      - 80:80
      - 443:443
    volumes:
      - ./nginx:/etc/nginx/conf.d
      - ${HOME}/${CERT_DIR}:/etc/nginx/ssl
      - some more folders
    logging:
      options:
        max-size: "256m"
        max-file: "3"
    restart: "unless-stopped"

 more-irrelevant-services

When doing the following the site hangs:

  1. change version to 9.9.166 (which is a version with minimal code changes)
  2. save changes
  3. docker-compose pull israelhikingmap/website
  4. docker-compose stop nginx
  5. docker-compose up -d website
  6. When using the website internally on the machine that runs the docker it works fine and reports healthy to the health api
  7. docker-compose up nginx
  8. Wait a few seconds and the health api stops responding...

If there are logs I can add or something I can do on the container itself please let me know as this problem is critical now as I can't update the version on my site.
We've seen some outages lately and we've tried all kinds of fixes and changes to the site but now I just can't upgrade to the next version which is problematic.
If you have instructions on what needs to be done on the container itself from command line it would be great as the container only has the runtime version of .net core and not the SDK with all the tools etc.

It feels somewhat similar to the following issue which was closed a long time ago:
aspnet/KestrelHttpServer#1267
@halter73 I've seen that you were active back then, if you have some insights it would be great.

I'd be happy to provide any information that will help solving this issue.
Also worth mentioning that we've used kestrel as front facing without nginx and it had some hangs, but I think it's worse now with nginx as front facing. having said that we needed a proxy support for a tile server we have and aspnet proxy does not work well with http/2 which we wanted.
More information on that can be found here:
twitchax/AspNetCore.Proxy#78
I'm probably missing a lot of details that came to cause the current situation, but I'll be happy to share any information that might be helpful...

@davidfowl
Copy link
Member

First thing I would recommend:

@davidfowl
Copy link
Member

Also consider using YARP as a proxy https://github.com/microsoft/reverse-proxy/ 😄 .

@HarelM
Copy link
Author

HarelM commented Sep 8, 2021

@davidfowl thanks for getting back to me! Are there easy instructions somewhere to install sdk from command-line in the docker container?

@HarelM
Copy link
Author

HarelM commented Sep 8, 2021

Nginx is battle tested, industry standard, spec compliant and probably support our needs, I'm sure YARP is nice, but it's still in preview mode and I doubt it will have everything we need.
The first thing would be good documentation which it lacks judging from the things I just read...
Good luck battle testing it in-house first :-) posing and answering SO questions and improving the documentation site :-)

@davidfowl
Copy link
Member

@davidfowl thanks for getting back to me! Are there easy instructions somewhere to install sdk from command-line in the docker container?

https://docs.microsoft.com/en-us/dotnet/core/diagnostics/diagnostics-in-containers

@adityamandaleeka adityamandaleeka added the Needs: Author Feedback The author of this issue needs to respond in order for us to continue investigating this issue. label Sep 8, 2021
@adityamandaleeka
Copy link
Member

@HarelM Can you also please collect netstat -vtan the next time this happens?

@HarelM
Copy link
Author

HarelM commented Sep 8, 2021

I'll do my best although I don't have a clue what this command does...
Also I tried to see if I can think of a reason my code can get stuck on a deadlock or something and came across this issue:
AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet#468
My token authentication code can be found here:
https://github.com/IsraelHikingMap/Site/blob/751a30b9c88e3e6dfb52d11605663c45dcc1deac/IsraelHiking.Web/OsmAccessTokenValidator.cs#L56
Which calls an outside service to get the user details to store in a cache.
This code screams deadlock...
When I open nginx to the world I allow multiple users to authenticate and this might cause this deadlock issue, I honestly don't know, but I also can't think of a good way to solve since I need to be able to get the user details in an asynchronous way...
I'll practice the given instructions on my dev env before I bring prod server down and harm my users trust :-)
It might take me a few days to get the relevant data.
Thanks for the quick response guys! I truly appreciate it.

@ghost ghost added Needs: Attention 👋 This issue needs the attention of a contributor, typically because the OP has provided an update. and removed Needs: Author Feedback The author of this issue needs to respond in order for us to continue investigating this issue. labels Sep 8, 2021
@HarelM
Copy link
Author

HarelM commented Sep 8, 2021

@adityamandaleeka where should I run netstat? inside the container or on the host machine? when trying to run this inside the container it says bash: netstat: command not found, should I apt-get it?

@davidfowl
Copy link
Member

If you can reproduce the hang (and it does sound like thread starvation) then getting a process dump would be good enough to tell.

@adityamandaleeka adityamandaleeka added Needs: Author Feedback The author of this issue needs to respond in order for us to continue investigating this issue. and removed Needs: Attention 👋 This issue needs the attention of a contributor, typically because the OP has provided an update. labels Sep 10, 2021
@sebastienros
Copy link
Member

I believe nginx is the problem as it's not easy to configure (it has bad defaults) and we have been hit by the same issue ourselves when doing benchmarking. If you are not able to get a netstat from the host that runs nginx, can you share the nginx configuration you are using (with redacted private information)?

@ghost
Copy link

ghost commented Sep 14, 2021

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment. If it is closed, feel free to comment when you are able to provide the additional information and we will re-investigate.

See our Issue Management Policies for more information.

@HarelM
Copy link
Author

HarelM commented Sep 15, 2021

Please do not close this issue. I will provide the data within the next few days...

@ghost ghost added Needs: Attention 👋 This issue needs the attention of a contributor, typically because the OP has provided an update. and removed Needs: Author Feedback The author of this issue needs to respond in order for us to continue investigating this issue. Status: No Recent Activity labels Sep 15, 2021
@HarelM
Copy link
Author

HarelM commented Sep 15, 2021

I updated to version 9.9.170 of our image (can be found in dockerhub).

/tools/dotnet-counters monitor --name dotnet
[System.Runtime]
    % Time in GC since last GC (%)                                 0    
    Allocation Rate (B / 1 sec)                              596,264    
    CPU Usage (%)                                                  0    
    Exception Count (Count / 1 sec)                                0    
    GC Fragmentation (%)                                          48.245
    GC Heap Size (MB)                                            451    
    Gen 0 GC Count (Count / 1 sec)                                 0    
    Gen 0 Size (B)                                               192    
    Gen 1 GC Count (Count / 1 sec)                                 0    
    Gen 1 Size (B)                                        18,396,080    
    Gen 2 GC Count (Count / 1 sec)                                 0    
    Gen 2 Size (B)                                        10,898,832    
    IL Bytes Jitted (B)                                    1,110,432    
    LOH Size (B)                                         603,194,128    
    Monitor Lock Contention Count (Count / 1 sec)                  0    
    Number of Active Timers                                        8    
    Number of Assemblies Loaded                                  216    
    Number of Methods Jitted                                  14,871    
    POH (Pinned Object Heap) Size (B)                      2,866,456    
    ThreadPool Completed Work Item Count (Count / 1 sec)           0    
    ThreadPool Queue Length                                    1,057    
    ThreadPool Thread Count                                      457    
    Working Set (MB)                                             897

The dump is 6Gb, so I'll have a hard time sending it.
Having said that, when I run pstack when analyzing the dump I get the following:

> pstacks                                                                                                                                                                                    

________________________________________________
 ~~~~ 1
    1 System.Threading.ManualResetEventSlim.Wait(Int32, CancellationToken)
    1 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, CancellationToken)
    1 System.Threading.Tasks.Task.InternalWaitCore(Int32, CancellationToken)
    1 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task)
    1 System.Runtime.CompilerServices.TaskAwaiter.GetResult()
    1 Microsoft.AspNetCore.Hosting.WebHostExtensions.Run(IWebHost)
    1 IsraelHiking.Web.Program.Main(String[])


________________________________________________
      ~~~~ 5f,6d,90,95...
        50 System.Threading.ManualResetEventSlim.Wait(Int32, CancellationToken)
        50 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, CancellationToken)
        50 System.Threading.Tasks.Task.InternalWaitCore(Int32, CancellationToken)
        50 System.Threading.Tasks.Task<System.__Canon>.GetResultCore(Boolean)
  610 IsraelHiking.Web.OsmAccessTokenValidator.ValidateToken(String, TokenValidationParameters, SecurityToken ByRef)
  610 Microsoft.AspNetCore.Authentication.JwtBearer.JwtBearerHandler+<HandleAuthenticateAsync>d__6.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<HandleAuthenticateAsync>d__6 ByRef)
  610 Microsoft.AspNetCore.Authentication.JwtBearer.JwtBearerHandler.HandleAuthenticateAsync()
  610 Microsoft.AspNetCore.Authentication.AuthenticationHandler<System.__Canon>+<AuthenticateAsync>d__48.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(__Canon> ByRef)
  610 Microsoft.AspNetCore.Authentication.AuthenticationHandler<System.__Canon>.AuthenticateAsync()
  610 Microsoft.AspNetCore.Authentication.AuthenticationService+<AuthenticateAsync>d__14.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<AuthenticateAsync>d__14 ByRef)
  610 Microsoft.AspNetCore.Authentication.AuthenticationService.AuthenticateAsync(HttpContext, String)
  610 Microsoft.AspNetCore.Authentication.AuthenticationMiddleware+<Invoke>d__6.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<Invoke>d__6 ByRef)
  610 Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext)
  610 Microsoft.AspNetCore.Cors.Infrastructure.CorsMiddleware.Invoke(HttpContext, ICorsPolicyProvider)
  610 Microsoft.AspNetCore.ResponseCompression.ResponseCompressionMiddleware+<Invoke>d__3.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<Invoke>d__3 ByRef)
  610 Microsoft.AspNetCore.ResponseCompression.ResponseCompressionMiddleware.Invoke(HttpContext)
  610 Microsoft.AspNetCore.HostFiltering.HostFilteringMiddleware.Invoke(HttpContext)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequests>d__223<System.__Canon>.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<ProcessRequests>d__223<__Canon> ByRef)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests(IHttpApplication<__Canon>)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequestsAsync>d__222<System.__Canon>.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<ProcessRequestsAsync>d__222<__Canon> ByRef)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequestsAsync(IHttpApplication<__Canon>)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.HttpConnection+<ProcessRequestsAsync>d__12<System.__Canon>.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(<ProcessRequestsAsync>d__12<__Canon> ByRef)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.HttpConnection.ProcessRequestsAsync(IHttpApplication<__Canon>)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.KestrelConnection<System.__Canon>+<ExecuteAsync>d__6.MoveNext()
  610 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(__Canon> ByRef)
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.KestrelConnection<System.__Canon>.ExecuteAsync()
  610 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.KestrelConnection<System.__Canon>.Execute()
  610 System.Threading.ThreadPoolWorkQueue.Dispatch()


________________________________________________
      ~~~~ 35
         1 (dynamicClass).IL_STUB_PInvoke(IntPtr, SocketEvent*, Int32*)
         1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
         1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)
           ~~~~ 3c
              1 System.Threading.Thread.Sleep(TimeSpan)
              1 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.TimerLoop()
              1 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat+<>c.ctor>b__8_0(Object)
           ~~~~ 18,1b,20
              3 (dynamicClass).IL_STUB_PInvoke(SafeHandle, Byte*, Int32)
              3 System.IO.FileSystemWatcher+RunningInstance.TryReadEvent(NotifyEvent ByRef)
              3 System.IO.FileSystemWatcher+RunningInstance.ProcessEvents()
              3 System.IO.FileSystemWatcher+RunningInstance+<>c.<Start>b__14_0(Object)
              3 System.Threading.Tasks.Task.InnerInvoke()
              3 System.Threading.Tasks.Task+<>c.cctor>b__277_0(Object)
              3 System.Threading.ExecutionContext.RunInternal(ExecutionContext, ContextCallback, Object)
              3 System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task ByRef, Thread)
              3 System.Threading.Tasks.Task.ExecuteEntryUnsafe(Thread)
              3 System.Threading.Tasks.ThreadPoolTaskScheduler+<>c.cctor>b__10_0(Object)
         4 System.Threading.ThreadHelper.ThreadStart_Context(Object)
         4 System.Threading.ExecutionContext.RunInternal(ExecutionContext, ContextCallback, Object)
    5 System.Threading.ThreadHelper.ThreadStart(Object)


________________________________________________
      ~~~~ 21
         1 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, CancellationToken)
         1 System.Threading.SemaphoreSlim.Wait(Int32, CancellationToken)
         1 System.Collections.Concurrent.BlockingCollection<Microsoft.Extensions.Logging.Console.LogMessageEntry>.TryTakeWithNoTimeValidation(LogMessageEntry ByRef, Int32, CancellationToken, CancellationTokenSource)
         1 System.Collections.Concurrent.BlockingCollection<Microsoft.Extensions.Logging.Console.LogMessageEntry>+<GetConsumingEnumerable>d__68.MoveNext()
         1 Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.ProcessLogQueue()
         1 System.Threading.ThreadHelper.ThreadStart_Context(Object)
         1 System.Threading.ExecutionContext.RunInternal(ExecutionContext, ContextCallback, Object)
      ~~~~ 4e0
         1 System.Threading.WaitHandle.WaitOneNoCheck(Int32)
         1 System.Diagnostics.Tracing.CounterGroup.PollForValues()
    2 System.Threading.ThreadHelper.ThreadStart()


==> 618 threads with 4 roots

Threadpool seems to be starved I guess...?

> threadpool                                                                                                                                                                                 
CPU utilization: 9 %
Worker Thread: Total: 610 Running: 610 Idle: 0 MaxLimit: 32767 MinLimit: 8
Work Request in Queue: 1
    Unknown Function: 00007F115AB82960  Context: 00005563E4447B60
--------------------------------------
Number of Timers: 3
--------------------------------------
Completion Port Thread:Total: 0 Free: 0 MaxFree: 16 CurrentLimit: 0 MaxLimit: 1000 MinLimit: 8
netstat -vtan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:8090            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:17500           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8989            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:445             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:41855           0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:17600         0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:41153           0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:17603         0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:34697           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:139             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:36655           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:9200            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49169           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:5939          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8088            0.0.0.0:*               LISTEN     
tcp6       0      0 :::8090                 :::*                    LISTEN     
tcp6       0      0 :::443                  :::*                    LISTEN     
tcp6       0      0 :::17500                :::*                    LISTEN     
tcp6       0      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 :::3389                 :::*                    LISTEN     
tcp6       0      0 :::445                  :::*                    LISTEN     
tcp6       0      0 :::2049                 :::*                    LISTEN     
tcp6       0      0 :::39361                :::*                    LISTEN     
tcp6       0      0 :::33443                :::*                    LISTEN     
tcp6       0      0 :::48547                :::*                    LISTEN     
tcp6       0      0 :::60165                :::*                    LISTEN     
tcp6       0      0 :::139                  :::*                    LISTEN     
tcp6       0      0 :::111                  :::*                    LISTEN     
tcp6       0      0 :::80                   :::*                    LISTEN     
tcp6       0      0 :::9200                 :::*                    LISTEN     
tcp6       0      0 :::49169                :::*                    LISTEN     
tcp6       0      0 :::8081                 :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 ::1:3350                :::*                    LISTEN     
tcp6       0      0 ::1:631                 :::*                    LISTEN     
tcp6       0      0 :::8088                 :::*                    LISTEN     

There were other specific IP addresses which seemed too specific which I removed from the list (Seemed like I shouldn't publish them), all of them were in ESTABLISHED state.

I can't make much out of it besides that my suspicion about the OSMTokenValidator which I already mentioned above.
Let me know what else I can do or try.

@davidfowl
Copy link
Member

Threadpool seems to be starved I guess...?

Yep.

Classic thread pool starvation:

    ThreadPool Queue Length                                    1,057    
    ThreadPool Thread Count                                      457    

This is one of the big clues.

Then your stacks:

@HarelM
Copy link
Author

HarelM commented Sep 15, 2021

Thanks for looking at it so fast!!
While this is probably the case I'm not sure I understand what causes this.
Let's assume that there's a thread handling every request a user makes and that this thread is stuck at the OSM token validator code (this is what I understand from the dump).
I've checked google analytics before trying to stop the server, and it told me there are about 3 active users. which makes sense as I started playing with the server at 1 AM (this server mainly servers Israel travelers and is not used much outside Israel).
There seems to be 400+ threads, how come all of them are stuck?
And also, how do I properly solve this.
I've receive a partial answer here:
AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet#468
But my main question is how to avoid token validation for the same user almost at the same time without using lock?

In any case, thanks for the super super super fast response!!

@HarelM
Copy link
Author

HarelM commented Sep 16, 2021

The following is most of our nginx configuration file content:

# HTTPS
server {
    listen          443 ssl http2;
    ssl_certificate path-to-crt-file;
    ssl_certificate_key path-to-key-file;

    client_max_body_size 0;

    error_page 404 500 502 503 504 = /resource-not-found.html;

# CORS
    proxy_hide_header   'Access-Control-Allow-Origin';
    add_header          'Access-Control-Allow-Origin' '*';

# Dynamic contents
    location = /        { proxy_pass http://website/; }
}

There are some static files configurations and http to https redirect too which I left out, but in general this is most of it, we just recently added the nginx due to problem with http/2 we needed to solve and nginx solved this well so the configuration itself is pretty basic.
Overall the site is a single server running all the docker containers using docker compose without using scale as the traffic is manageable.

Any help would be appreciated.

@sebastienros
Copy link
Member

listen 443 ssl http2;

I don't believe this will force clients to use h2, but will allow for negotiating it. Hence they might not actually use h2 and fallback to 1.1. However, nginx doesn't use 1.1 by default, but 1.0. You need to allow HTTP 1.1 manually otherwise it will be a huge performance issue (a connection is created for each request).

proxy_http_version 1.1;

Source: https://docs.microsoft.com/en-us/aspnet/core/host-and-deploy/linux-nginx?view=aspnetcore-5.0#configure-nginx

Another thing I noticed in your source code is the use of Task.Run to "fake" and async call. By doing this you are creating extra threads, which will explain the thread pool exhaustion and hangs under load. Here is the culprit: https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.DataAccess/OsmRepository.cs#L27 though I haven't looked at the rest of the code. In that specific case, if nothing inside the method is async, just return Task.FromResult(elements). That won't spawn another thread.

@sebastienros
Copy link
Member

NB: Other code review feedback, the way you call into logging is not exactly right.

_logger.LogError($"All Retries failed while trying to get data from {language}.wikipedia");

should be written

_logger.LogError("All Retries failed while trying to get data from {Language}.wikipedia", language);

It prevents unnecessary allocations (not really in this example since it's an Error level so probably will always be invoked in the end) by not creating the formatted string if the error level is too high. It also provides better logging when the provider supports structured logging.

@davidfowl
Copy link
Member

Isn't it this stack causing the problems? I don't know how to read the pstacks output but I;m assuming this is bad 😄 .

      ~~~~ 5f,6d,90,95...
        50 System.Threading.ManualResetEventSlim.Wait(Int32, CancellationToken)
        50 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, CancellationToken)
        50 System.Threading.Tasks.Task.InternalWaitCore(Int32, CancellationToken)
        50 System.Threading.Tasks.Task<System.__Canon>.GetResultCore(Boolean)
  610 IsraelHiking.Web.OsmAccessTokenValidator.ValidateToken(String, TokenValidationParameters, SecurityToken ByRef)

You can also run clrstack -all and see if it shows similar results.

@HarelM
Copy link
Author

HarelM commented Sep 16, 2021

Yes, this is the OSMAccessTokenValidator I referenced before publishing the dump results - it was my main suspect.
I have just pushed a commit to a branch to hopefully solve this, I'll try it later tonight.
This is the relevant commit:
IsraelHikingMap/Site@01ceb42

Thanks for the review of the config and my code, I'll do the relevant changes, but I'm not sure it fully explains the hang.
I'll probably know better tonight...

@davidfowl
Copy link
Member

I'm pretty sure it explains the hang 😄. Once all thread pool starvation is removed then you can start looking other places. When your threadpool queue is empty, then we can find the next reason for the hangs (if they are any).

@sebastienros
Copy link
Member

I will close this issue since there is nothing that demonstrates an issue in Kestrel directly but there are many hints that the issues are in your code. Please add more comments as you think you have removed the usage of Task.Run in your code, and you still need help with investigating.

@davidfowl
Copy link
Member

@HarelM
Copy link
Author

HarelM commented Sep 16, 2021

We are talking about 15 files here. I.e. 15 threads at most.
The files are relatively large - i.e. 10Mb each, why is this a performance trap?
I wanted to load the service fast so I split this part into several tasks which loaded the service a lot faster obviously. I'd be happy to know what I'm missing here...

@davidfowl
Copy link
Member

The files are relatively large - i.e. 10Mb each, why is this a performance trap?

https://devblogs.microsoft.com/dotnet/large-object-heap/
https://docs.microsoft.com/en-us/aspnet/core/performance/memory?view=aspnetcore-5.0#large-objects-heap

We are talking about 15 files here. I.e. 15 threads at most.

What does 15 files have to do with 15 threads? I missed the context...

@HarelM
Copy link
Author

HarelM commented Sep 16, 2021

Even more important, this code, in a loop: https://github.com/IsraelHikingMap/Site/blob/d9b73f920ef00378604f14febcddf0a71109d55b/IsraelHiking.DataAccess/elevation-cache/ElevationDataStorage.cs#L62

I was referring to this comment about the fact that this is running in a loop - but there are only 15 files there.
Regarding the large heap object, I'm not sure there's a problem here - the files are loaded to memory once in this singleton class and stay there for the rest of the operation of the server, they are not disposed, in theory. As far as I know this is fair play...

There are still a few people using our server right now so I didn't want to bring it down just yet to test my change.
Meanwhile I added the app.UseBlockingDetection(); to see if I can spot things that I can fix.
Unfortunately it flagged a lot of places I have no clue how to solve correctly, including reading a zip file, the new code I wrote to overcome the lock I used in the past, etc.
I've been using .net for the last 10+ year, I know all about async/await and the possibility of deadlock and threadpool starvation and still it seems I can't write "good" code.
This feels like the opposite of the pit of success...

I don't mind changing every line of code in my app to follow some guidelines to avoid thread pool starvation and dead locks, but it seems that I don't know how.
@davidfowl I saw your tweet about the fact that you get a mail every so often and the problem is in most cases threadpool starvation, I think it's a red flag about the usability of the things you guys develop and how easy it is to use them incorrectly.

I'm not doing anything super complex in my server. My code is a few dlls and still...
Any help would be appreciated, I'm out of ideas how to address this situation...

@sebastienros
Copy link
Member

This loop in particular might not be the root cause of the thread pool starvation. But you still need to remove the usage of Task.Run in every code path that is run per request.

@davidfowl
Copy link
Member

@davidfowl I saw your tweet about the fact that you get a mail every so often and the problem is in most cases threadpool starvation, I think it's a red flag about the usability of the things you guys develop and how easy it is to use them incorrectly.

I hear ya, it's a hard problem to solve. We've done some work in .NET 6 to improve it

@HarelM
Copy link
Author

HarelM commented Sep 16, 2021

@sebastienros Thanks for the info! I'll look for code path that are ran per request (mostly the stuff that run per request do CRUD using elastic so I tend to feel I'm safe there).
I've just updated the version on the server and the fix I did to the OSM Token validator seems to have fixed the hang.
I'm still not sure the code there is good enough (besides the stuff I need to clean up), if you guys can have a look at the linked commit (IsraelHikingMap/Site@01ceb42) it would be super helpful.
The relevant issue, which is still not fixed, and I think is the root cause of the issue here is the following:
AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet#468
If I had an async method to begin with and if I had some warning about using lock it might have made my life easier.

In any case, I truly appreciate your dedication and quick response!

@davidfowl
Copy link
Member

Are you looking to do something like this?

@HarelM
Copy link
Author

HarelM commented Sep 19, 2021

It seems like it's using two locks and non-concurrent dictionary. To me it seems like a weird solution... It might be the right solution, but I wouldn't have thought to write it this way after reading all the warnings about using lock and non-concurrent data structures... :-/

@davidfowl
Copy link
Member

Well he's the one that wrote ConcurrentDictionary, so I would copy it if I were you 😉

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel Needs: Attention 👋 This issue needs the attention of a contributor, typically because the OP has provided an update.
Projects
None yet
Development

No branches or pull requests

6 participants