-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery API fails silently after high traffic (200 req / second) #1258
Comments
As per @LtChae, this error is likely caused by the worker pool in Prestige not being large enough to accommodate the number of requests. |
After several further attempts to reproduce this, we were unable to crash the API as described. It would crash, but not silently. It would start up a new API pod within seconds with minimal down time. @christinepoydence are you alright with me closing this until this is ever reproduced. Can reopen at that time. |
When making 50 requests to the API, after scaling it's deployment to 4 instances, we received the same error (shown below and originally mentioned above) on one of the four pods. The pod was not reported as unhealthy, so we had to find which one it was, restart it, and attempt the test again. After 2nd attempt of the test, the same thing occured, so it seems to be more consistently reproducible now. Is this a problem like Tim said, where we need to scale the Prestige workers? And the API Pods are actually fine?
|
From that error, it looks like it may be timing out trying to get access to presto. Try running the test with the presto console up and watch to see if presto struggles with the load. |
@LtChae - we did this and didn't see any issues in the preto console. It seemed to only receive 45 out of the 50 requests that we sent, but it handled those just fine. |
Describe the bug
When stress testing discovery-api, if it becomes overwhelmed, it's presto connection will break and not automatically restore. The API will become non-functional, and doesn't know to restart itself.
To Reproduce
Steps to reproduce the behavior:
/api/v1/organization/parkmobile/dataset/parking_meter_transactions_2020/query
with 200 users. (May take multiple attempts / a slight bump to the 200 number)Expected behavior
If the API Reaches the broken presto
checkout_timeout
state, that's fine, but it should at least be detected somehow, possibly in the health endpoint. This way the pod can restore itself to a functional state again.Additional context
Originally discovered by @bmitchinson and @christinepoydence
The text was updated successfully, but these errors were encountered: