Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.0.24 request routing causes massive performance regression #2579

Open
2 of 7 tasks
johnmadden-sp opened this issue Dec 17, 2024 · 15 comments · May be fixed by #2590
Open
2 of 7 tasks

6.0.24 request routing causes massive performance regression #2579

johnmadden-sp opened this issue Dec 17, 2024 · 15 comments · May be fixed by #2590
Assignees

Comments

@johnmadden-sp
Copy link

johnmadden-sp commented Dec 17, 2024

Issue report

Are you sure this is a bug in Passenger?

Not necessarily a bug but the implementation change -- we think -- caused a severe increase in latency leading to outages. The 6.0.24 changelog mentions: We changed the way we route requests. Instead of picking the least-busy process, we now first prioritize new processes first. Reverting to 6.0.23 fixes the problem.

Please try with the newest version of Passenger to avoid issues that have already been fixed
New issue in 6.0.24

Question 1: What is the problem?

  • What is the expected behavior? No change in performance from 6.0.23
  • What is the actual behavior? Unknown performance degradation leads to huge latency issues and outages. With a timeout of 2 minutes, passengers are being restarted automatically, leading to increased CPU use on startup. The concept of routing requests to new passengers is the culprit, I think the issue is these haven't been warmed up yet, leading to a lot of latency and that passenger being killed off again. The proc manager kills the child passenger, sending a 502 to nginx, etc.
  • How can we reproduce it? Deploy new [Rails] containers on 6.0.24 and send traffic to them. We did nothing special here, the impact was immediate and obvious. These are just docker containers, no rolling restarts. I think we should instead provide a restart configuration option or something to allow the user to pick the algorithm.

Question 2: Passenger version and integration mode:

  • Enterprise, nginx, libnginx-mod-http-passenger-enterprise

Question 3: OS or Linux distro, platform (including version):

  • Docker image phusion/passenger-ruby32:3.0.7 amd64.

Question 4: Passenger installation method:

Your answer:

  • RubyGems + Gemfile
  • RubyGems, no Gemfile
  • Phusion APT repo
  • Phusion YUM repo
  • OS X Homebrew
  • source tarball
  • Other, please specify:

phusion-supplied docker images, but with an apt-get install libnginx-mod-http-passenger-enterprise the 6.0.24 upgrade takes place.

Question 5: Your app's programming language (including any version managers) and framework (including versions):

  • Ruby 3.2.x, Rails 7.x

Question 6: Are you using a PaaS and/or containerization? If so which one?

  • We're building containers off of phusion/passenger-ruby32:3.0.7
@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

No rolling restarts are involved so the difference between 2.0.23 and 2.0.24 is that older processes are prioritized over less busy processes. The newness bias is only involved when rolling restarting (ie processes from both before and after a restart was begun).

So what you are actually seeing is that older more-warmed-up processes are responding more slowly than less-busy processes, which is surprising, so it's possible that it could be due to another change in 6.0.24, or it might just be that the warmed-up processes are slower, we'll have to gather more data.

Can you provide a bit more information:

Are you using multi-threading?
How many app-processes are you having passenger create?
Are you using Passenger's pre-start option?
Can you capture the latencies of the first several requests to a container with 2.0.23 and 2.0.24?

Can you be more detailed with regards to this bit:

With a timeout of 2 minutes, Passengers are being restarted automatically, […] The proc manager kills the child Passenger, sending a 502 to Nginx, etc.

Specifically, is the Passenger process being killed by a watchdog or are the Passenger-managed app processes being killed by Passenger?

@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

Just a note for myself: I can diff these tags with the following to get the most suspect changes: git diff release-6.0.23 release-6.0.24 -- src ':!src/cxx_supportlib/vendor-modified/boost'

@johnmadden-sp
Copy link
Author

  • Yes to multi-threading
  • 12 procs (in this app group)
  • Further config:
passenger_root /usr/lib/ruby/vendor_ruby/phusion_passenger/locations.ini;
passenger_ruby /usr/bin/ruby;
passenger_instance_registry_dir /var/passenger-instreg;
passenger_core_file_descriptor_ulimit 16384;
passenger_start_timeout 180;
passenger_pre_start http://localhost/websocket;
passenger_show_version_in_header off;
passenger_intercept_errors on;
passenger_rolling_restarts on;
passenger_pool_idle_time 3600;
passenger_max_request_queue_size 1000;
passenger_max_request_time 120;
passenger_concurrency_model thread;
passenger_max_pool_size 15;
passenger_thread_count 10;
passenger_request_queue_overflow_status_code 504;
  • Latencies... No, there'd be now way of capturing the first few. But after 30 minutes of runtime,
    p95 on 6.0.23 ~390ms
    p95 on 6.0.24 ~1800ms
  • As for what's doing the killing,
InternalUtils.cpp:80 ]: [Client 14-3168] Sending 502 response: application did not send a complete response
CheckoutSession.cpp:449 ]: [Client 3-3129] Maximum request time of 120 seconds reached, killing process 851

@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

Ok so the oldest process is likely being loaded with 10 requests before the next process is chosen, and then Passenger is killing the app process because some requests aren't done in 120s. @FooBarWidget thoughts?

@johnmadden-sp
Copy link
Author

We did indeed see high thread use and busyness. (I assumed this was more tied to latency and thus just regular use-the-capacity.) Can we maybe round-robin or otherwise better distribute to new passengers, or maybe only start sending to new ones once a threshold of them are ready? (But again we aren't doing rolling restarts, this is happening in new containers on deployment.)

@CamJN
Copy link
Member

CamJN commented Dec 24, 2024

Can you tell us how you decided on the 12 processes and 10 threads configuration?

@CamJN
Copy link
Member

CamJN commented Dec 27, 2024

I'm having trouble reproducing this issue. I'm able to load nginx+Passenger Enterprise set to 12 procs and 10 threads with 120 concurrent requests and 50,000 total requests and the worst time I get is 1.6 seconds to finish processing a request+reply, with a mean of 208ms and a median of 185ms. And that's with a 6 core machine, so it's actually a bit over-loaded in that configuration.

@joelngwt
Copy link

joelngwt commented Jan 2, 2025

I've recently been testing an upgrade from 6.0.18 to 6.0.24, and have also noticed an increase in request times. I'm not sure if this is the exact issue (I haven't got around to trying out 6.0.23 yet), but there has been no other changes other than updating Passenger and nginx (nginx stayed at 1.18, only applied security updates).

I have Locust load testing results below. In 6.0.24, the requests per second has dropped, it has introduced a lot of jitter in the response times, and the average p95 has increased. No change in the median response times.

New Project

Edit: I'm not using Enterprise, if that matters. I just noticed the changelog for the routing changes has an [Enterprise] tag on it.

@CamJN
Copy link
Member

CamJN commented Jan 2, 2025

The swap of priority between oldest-process and least-busy-process when routing is also the case in Passenger OSS, the Enterprise bit was the prioritization of new processes during a rolling restart which isn't relevant to this issue.

Are you also using the docker image like @johnmadden-sp?

@joelngwt
Copy link

joelngwt commented Jan 3, 2025

Not using Docker, I'm running it on Ubuntu 22 on EC2 instances. Used apt to install (https://www.phusionpassenger.com/docs/tutorials/deploy_to_production/installations/oss/aws/ruby/nginx/)

Other details:

  • Ruby 3.2.5 with YJIT
  • Rails 7.1.4
  • passenger_pre_start is used
  • passenger_min_instances 15;
  • passenger_max_pool_size 30;

@CamJN
Copy link
Member

CamJN commented Jan 7, 2025

Actually, come to think of it, @joelngwt if you're using OSS Passenger then the concurrency per process is set to 1 by default as you cannot use multithreading, so the new routing would behave exactly the same as before right @FooBarWidget?

@johnmadden-sp
Copy link
Author

@CamJN 12 procs as the result of cpu count and threads to get better concurrency with i/o waits. (I can't go into a great level of detail publicly of course.)

@CamJN
Copy link
Member

CamJN commented Jan 13, 2025

@johnmadden-sp are you using sticky sessions by any chance?

@CamJN
Copy link
Member

CamJN commented Jan 13, 2025

@FooBarWidget I had a thought, since joelngwt should be seeing the same process selection as 6.0.23 due to OSS not having multithreading, could it be that I messed up the changes to SessionManagement.cpp? Could you please take a look and think about if I got the logic right there? And if the changes there are fine, can you suggest an approach for getting the rolling-restart state in the session management methods so I can limit the new routing to during RRs?

@johnmadden-sp
Copy link
Author

@CamJN no, no sticky sessions.

@CamJN CamJN linked a pull request Jan 20, 2025 that will close this issue
@CamJN CamJN linked a pull request Jan 20, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants