[server] Fixed storage node read quota usage ratio spikes #1256
+394
−146
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[server] Fixed storage node read quota usage ratio spikes
We observed unusually high quota usage ratio spikes (7x) without any rejected requests. This is mathematically impossible if the spikes were caused by an actual KPS spike since our default sampling window is 30s and token bucket capacity multiplier is only 5x. This is only possible when the node responisbility (divisor) changed and we are still using the "old" kps to calculate the usage ratio. The fix is to re-initialize the KPS metric that's sampled at a 30s interval. This way we don't get in the situation where the node responsibility is reduced and most recent KPS is low but the KPS used as the dividend remains high while the divisor could be a lot smaller resulting in a spike in usage ratio.
Another issue was that node responsibility could be updated whenever partition assignment changed, regardless if the change was actually for the serving/current version. This is bad since it will not only fluctuate the usage ratio in unexpected ways but also making it inaccurate. The fix is to only update the store level stats node responsibility if the change is on the serving/current version. Otherwise we keep it track in a separate map and make the update later when we check for current version change in
How was this PR tested?
New unit and existing integration tests
Does this PR introduce any user-facing changes?