Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

bboreham · 2024-09-25T14:25:53Z

What this PR does

Suggested runbook text.

pstibrany

Thank you, lgtm from technical point of view.

tacole02

Looks good! A few minor suggestions.

tacole02 · 2024-09-25T22:26:28Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

@@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load:

 ### MimirIngesterReachingSeriesLimit

-This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
+This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.


Suggested change

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester reaches that limit.

tacole02 · 2024-09-25T22:27:08Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

@@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load:

 ### MimirIngesterReachingSeriesLimit

-This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
+This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
+The threshold is set at 80%, to give some chance to react before the limit is reached.


Suggested change

The threshold is set at 80%, to give some chance to react before the limit is reached.

The threshold is set at 80% to give the chance to react before the limit is reached.

tacole02 · 2024-09-25T22:27:50Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

-This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
+This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
+The threshold is set at 80%, to give some chance to react before the limit is reached.
+Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.


Suggested change

Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.

After the limit is reached, writes to the ingester fail for new series. Appending samples to existing ones continue to succeed.

Can we avoid using "writes" as a noun? Could we say "write requests" or something else, if it's more accurate. We could then also say "Appending samples to existing requests continue to succeed" for greater clarity.

We avoid using "will" in the docs.

tacole02 · 2024-09-25T22:29:56Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+The threshold is set at 80%, to give some chance to react before the limit is reached.
+Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.
+
+Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.


classed = classified?
Also, could we say "server errors" for agreement?

tacole02 · 2024-09-25T22:30:14Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.
+
+Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.
+While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.


Suggested change

While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.

While this situation continues, these retries stall the flow of data, and newer data queues up on the sender.

tacole02 · 2024-09-25T22:30:33Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+
+Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender.
+While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
+If the condition is cleared in a short time, service can be restored with no data loss.


Suggested change

If the condition is cleared in a short time, service can be restored with no data loss.

If the condition is cleared in a short time, service is restored with no data loss.

tacole02 · 2024-09-25T22:30:57Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
+If the condition is cleared in a short time, service can be restored with no data loss.
+
+This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded.


Suggested change

This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded.

This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded.

bboreham requested review from tacole02 and a team as code owners September 25, 2024 14:25

pstibrany approved these changes Sep 25, 2024

View reviewed changes

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries

043aa3d

bboreham force-pushed the clarify-MimirIngesterReachingSeriesLimit branch from b54c094 to 043aa3d Compare September 25, 2024 15:22

tacole02 approved these changes Sep 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

bboreham commented Sep 25, 2024

pstibrany left a comment

tacole02 left a comment

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

tacole02 Sep 25, 2024

	This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit.
	This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester reaches that limit.

	The threshold is set at 80%, to give some chance to react before the limit is reached.
	The threshold is set at 80% to give the chance to react before the limit is reached.

	Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed.
	After the limit is reached, writes to the ingester fail for new series. Appending samples to existing ones continue to succeed.

	While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender.
	While this situation continues, these retries stall the flow of data, and newer data queues up on the sender.

	If the condition is cleared in a short time, service can be restored with no data loss.
	If the condition is cleared in a short time, service is restored with no data loss.

	This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded.
	This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded.

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

Are you sure you want to change the base?

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

Conversation

bboreham commented Sep 25, 2024

What this PR does

pstibrany left a comment

Choose a reason for hiding this comment

tacole02 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment