Discovery API fails silently after high traffic (200 req / second) #1258

bmitchinson · 2021-06-21T14:54:32Z

Describe the bug
When stress testing discovery-api, if it becomes overwhelmed, it's presto connection will break and not automatically restore. The API will become non-functional, and doesn't know to restart itself.

To Reproduce
Steps to reproduce the behavior:

Download JMeter for stress testing
GET Query the dev discovery-api /api/v1/organization/parkmobile/dataset/parking_meter_transactions_2020/query with 200 users. (May take multiple attempts / a slight bump to the 200 number)
Notice that dev discovery ui will fail to load any SQL terminals and report that it failed to reach the API
Confirm that the discovery-api pod errors when attempting to connect to presto, as described with the below error and shown in pod logs.

[error] POST http://kdp-kubernetes-data-platform-presto.kdp:8080/v1/statement -> error: :checkout_timeout (80
[error] #PID<0.23441.1> running DiscoveryApiWeb.Endpoint (connection #PID<0.4612.0>, stream id 11) terminated

Expected behavior
If the API Reaches the broken presto checkout_timeout state, that's fine, but it should at least be detected somehow, possibly in the health endpoint. This way the pod can restore itself to a functional state again.

Additional context
Originally discovered by @bmitchinson and @christinepoydence

The text was updated successfully, but these errors were encountered:

christinepoydence · 2021-06-21T17:10:39Z

As per @LtChae, this error is likely caused by the worker pool in Prestige not being large enough to accommodate the number of requests.

bmitchinson · 2021-06-22T18:36:19Z

After several further attempts to reproduce this, we were unable to crash the API as described.

It would crash, but not silently. It would start up a new API pod within seconds with minimal down time.

@christinepoydence are you alright with me closing this until this is ever reproduced. Can reopen at that time.

bmitchinson · 2021-06-24T16:23:48Z

When making 50 requests to the API, after scaling it's deployment to 4 instances, we received the same error (shown below and originally mentioned above) on one of the four pods. The pod was not reported as unhealthy, so we had to find which one it was, restart it, and attempt the test again. After 2nd attempt of the test, the same thing occured, so it seems to be more consistently reproducible now.

Is this a problem like Tim said, where we need to scale the Prestige workers? And the API Pods are actually fine?

16:06:48.899 [error] POST http://kdp-kubernetes-data-platform-presto.kdp:8080/v1/statement -> error: :timeout (8009.072 ms)
16:06:48.899 [error] Error explaining statement: SELECT * FROM parkmobile__parking_meter_transactions_2020
ORDER BY paymentdate DESC
LIMIT 1
16:06:48.899 [error] %FunctionClauseError{args: nil, arity: 2, clauses: nil, function: :validate, kind: nil, module: Prestige.Client.RequestStream}

LtChae · 2021-06-24T16:29:05Z

From that error, it looks like it may be timing out trying to get access to presto. Try running the test with the presto console up and watch to see if presto struggles with the load.

christinepoydence · 2021-06-24T16:31:18Z

@LtChae - we did this and didn't see any issues in the preto console. It seemed to only receive 45 out of the 50 requests that we sent, but it handled those just fine.

bmitchinson added the bug Something isn't working label Jun 21, 2021

christinepoydence mentioned this issue Jun 24, 2021

[SPIKE] Horizontally Scale Discovery API and Retest Performance #1260

Closed

3 tasks

ksmith-accenture added dev On Hold Assigned to cards who were originally under the column titled 'On Hold' and removed On Hold Assigned to cards who were originally under the column titled 'On Hold' labels Jul 9, 2021

jessicapfoster added the tech debt label Jul 13, 2021

ksmith-accenture added the UrbanOS label Oct 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery API fails silently after high traffic (200 req / second) #1258

Discovery API fails silently after high traffic (200 req / second) #1258

bmitchinson commented Jun 21, 2021

christinepoydence commented Jun 21, 2021

bmitchinson commented Jun 22, 2021

bmitchinson commented Jun 24, 2021

LtChae commented Jun 24, 2021

christinepoydence commented Jun 24, 2021 •

edited

Loading

Discovery API fails silently after high traffic (200 req / second) #1258

Discovery API fails silently after high traffic (200 req / second) #1258

Comments

bmitchinson commented Jun 21, 2021

christinepoydence commented Jun 21, 2021

bmitchinson commented Jun 22, 2021

bmitchinson commented Jun 24, 2021

LtChae commented Jun 24, 2021

christinepoydence commented Jun 24, 2021 • edited Loading

christinepoydence commented Jun 24, 2021 •

edited

Loading