Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Fixed storage node read quota usage ratio spikes #1256

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xunyin8
Copy link
Contributor

@xunyin8 xunyin8 commented Oct 23, 2024

[server] Fixed storage node read quota usage ratio spikes

  1. We observed unusually high quota usage ratio spikes (7x) without any rejected requests. This is mathematically impossible if the spikes were caused by an actual KPS spike since our default sampling window is 30s and token bucket capacity multiplier is only 5x. This is only possible when the node responisbility (divisor) changed and we are still using the "old" kps to calculate the usage ratio. The fix is to re-initialize the KPS metric that's sampled at a 30s interval. This way we don't get in the situation where the node responsibility is reduced and most recent KPS is low but the KPS used as the dividend remains high while the divisor could be a lot smaller resulting in a spike in usage ratio.

  2. Another issue was that node responsibility could be updated whenever partition assignment changed, regardless if the change was actually for the serving/current version. This is bad since it will not only fluctuate the usage ratio in unexpected ways but also making it inaccurate. The fix is to only update the store level stats node responsibility if the change is on the serving/current version. Otherwise we keep it track in a separate map and make the update later when we check for current version change in

How was this PR tested?

New unit and existing integration tests

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@sushantmane
Copy link
Contributor

Can you add some info about what's the fix for the problems found?

@xunyin8
Copy link
Contributor Author

xunyin8 commented Oct 23, 2024

Can you add some info about what's the fix for the problems found?

Sure.
For 1 the fix is to re-initialize the KPS metric that's sampled at a 30s interval. This way we don't get in the situation where the node responsibility is reduced and most recent KPS is low but the KPS used as the dividend remains high while the divisor could be a lot smaller resulting in a spike in usage ratio.

For 2 we will only update the store level stats node responsibility if the change is on the serving/current version. Otherwise we keep it track in a separate map and make the update later when we check for current version change in handleStoreChanged

Taking the recommendation from Gaojie instead and will update the commit message or PR description with the fix for the problems found.

1. We observed unusually high quota usage ratio spikes (7x) without any rejected requests. This is mathematically
impossible if the spikes were caused by an actual KPS spike since our default sampling window is 30s and token
bucket capacity multiplier is only 5x. This is possible because we currently can have a mismatch between KPS
and node responsibility of different versions (current, backup and future). i.e. Depending on the timing of things we
could be calculating ratio using the KPS of current + backup and divide that by the node responsibility of a future
version which has its replicas partially assigned, resulting in a huge spike in usage ratio.

2. To solve this we will monitor the KPS and QPS based on corresponding versions. Version swap is not atomic when we
have many routers and fast clients updating their current version metadata separately. Therefore, it's expected for
some short periods of time to receive traffic for both current and backup versions. We will track the requested quota
stats separately and use the current version stats to calculate the usage ratio. We can also have alerts on the
backup version requested stats since it's expected to drop to 0 if all the fast clients and routers are updating
their metadata correctly.

3. Quota rejection will remain to be enforced at the version level but emit stats at the store level. We can convert
it to versioned as well but currently don't see the needs to do so.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants