Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bboreham
Copy link
Contributor

What this PR does

Suggested runbook text.

@bboreham bboreham requested review from tacole02 and a team as code owners September 25, 2024 14:25
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, lgtm from technical point of view.

@bboreham bboreham force-pushed the clarify-MimirIngesterReachingSeriesLimit branch from b54c094 to 043aa3d Compare September 25, 2024 15:22
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few minor suggestions.

@@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load:

### MimirIngesterReachingSeriesLimit

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester reaches that limit.

@@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load:

### MimirIngesterReachingSeriesLimit

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
The threshold is set at 80%, to give some chance to react before the limit is reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The threshold is set at 80%, to give some chance to react before the limit is reached.
The threshold is set at 80% to give the chance to react before the limit is reached.

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
The threshold is set at 80%, to give some chance to react before the limit is reached.
Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.
After the limit is reached, writes to the ingester fail for new series. Appending samples to existing ones continue to succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using "writes" as a noun? Could we say "write requests" or something else, if it's more accurate. We could then also say "Appending samples to existing requests continue to succeed" for greater clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid using "will" in the docs.

The threshold is set at 80%, to give some chance to react before the limit is reached.
Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.

Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classed = classified?
Also, could we say "server errors" for agreement?

Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.

Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.
While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
While this situation continues, these retries stall the flow of data, and newer data queues up on the sender.


Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.
While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
If the condition is cleared in a short time, service can be restored with no data loss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the condition is cleared in a short time, service can be restored with no data loss.
If the condition is cleared in a short time, service is restored with no data loss.

While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
If the condition is cleared in a short time, service can be restored with no data loss.

This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded.
This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants