[server] Fixed storage node read quota usage ratio spikes #1256

xunyin8 · 2024-10-23T07:28:51Z

[server] Fixed storage node read quota usage ratio spikes

We observed unusually high quota usage ratio spikes (7x) without any rejected requests. This is mathematically impossible if the spikes were caused by an actual KPS spike since our default sampling window is 30s and token bucket capacity multiplier is only 5x. This is only possible when the node responisbility (divisor) changed and we are still using the "old" kps to calculate the usage ratio. The fix is to re-initialize the KPS metric that's sampled at a 30s interval. This way we don't get in the situation where the node responsibility is reduced and most recent KPS is low but the KPS used as the dividend remains high while the divisor could be a lot smaller resulting in a spike in usage ratio.
Another issue was that node responsibility could be updated whenever partition assignment changed, regardless if the change was actually for the serving/current version. This is bad since it will not only fluctuate the usage ratio in unexpected ways but also making it inaccurate. The fix is to only update the store level stats node responsibility if the change is on the serving/current version. Otherwise we keep it track in a separate map and make the update later when we check for current version change in

How was this PR tested?

New unit and existing integration tests

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

sushantmane · 2024-10-23T18:45:24Z

Can you add some info about what's the fix for the problems found?

xunyin8 · 2024-10-23T18:51:26Z

Can you add some info about what's the fix for the problems found?

~~Sure.~~
For 1 the fix is to re-initialize the KPS metric that's sampled at a 30s interval. This way we don't get in the situation where the node responsibility is reduced and most recent KPS is low but the KPS used as the dividend remains high while the divisor could be a lot smaller resulting in a spike in usage ratio.

For 2 we will only update the store level stats node responsibility if the change is on the serving/current version. Otherwise we keep it track in a separate map and make the update later when we check for current version change in handleStoreChanged

Taking the recommendation from Gaojie instead and will update the commit message or PR description with the fix for the problems found.

...es/venice-server/src/main/java/com/linkedin/venice/listener/ReadQuotaEnforcementHandler.java

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerQuotaUsageStats.java

1. We observed unusually high quota usage ratio spikes (7x) without any rejected requests. This is mathematically impossible if the spikes were caused by an actual KPS spike since our default sampling window is 30s and token bucket capacity multiplier is only 5x. This is possible because we currently can have a mismatch between KPS and node responsibility of different versions (current, backup and future). i.e. Depending on the timing of things we could be calculating ratio using the KPS of current + backup and divide that by the node responsibility of a future version which has its replicas partially assigned, resulting in a huge spike in usage ratio. 2. To solve this we will monitor the KPS and QPS based on corresponding versions. Version swap is not atomic when we have many routers and fast clients updating their current version metadata separately. Therefore, it's expected for some short periods of time to receive traffic for both current and backup versions. We will track the requested quota stats separately and use the current version stats to calculate the usage ratio. We can also have alerts on the backup version requested stats since it's expected to drop to 0 if all the fast clients and routers are updating their metadata correctly. 3. Quota rejection will remain to be enforced at the version level but emit stats at the store level. We can convert it to versioned as well but currently don't see the needs to do so.

xunyin8 requested review from sushantmane and gaojieliu October 24, 2024 20:05

majisourav99 reviewed Oct 24, 2024

View reviewed changes

...es/venice-server/src/main/java/com/linkedin/venice/listener/ReadQuotaEnforcementHandler.java Outdated Show resolved Hide resolved

gaojieliu reviewed Oct 25, 2024

View reviewed changes

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerQuotaUsageStats.java Outdated Show resolved Hide resolved

xunyin8 force-pushed the FixSNQuotaMetricIssue branch from f94826a to 1b7b9bf Compare October 28, 2024 02:55

xunyin8 force-pushed the FixSNQuotaMetricIssue branch from 1b7b9bf to 10e2d1d Compare October 28, 2024 04:41

xunyin8 requested a review from gaojieliu October 28, 2024 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Fixed storage node read quota usage ratio spikes #1256

[server] Fixed storage node read quota usage ratio spikes #1256

xunyin8 commented Oct 23, 2024 •

edited

Loading

sushantmane commented Oct 23, 2024

xunyin8 commented Oct 23, 2024 •

edited

Loading

[server] Fixed storage node read quota usage ratio spikes #1256

Are you sure you want to change the base?

[server] Fixed storage node read quota usage ratio spikes #1256

Conversation

xunyin8 commented Oct 23, 2024 • edited Loading