-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add graceful shutdown to loki.write
draining the WAL
#5804
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc change is 👍
3514f0f
to
6c33067
Compare
@@ -90,19 +91,22 @@ func TestUnmarshallWalAttrributes(t *testing.T) { | |||
MaxSegmentAge: wal.DefaultMaxSegmentAge, | |||
MinReadFrequency: wal.DefaultWatchConfig.MinReadFrequency, | |||
MaxReadFrequency: wal.DefaultWatchConfig.MaxReadFrequency, | |||
DrainTimeout: wal.DefaultWatchConfig.DrainTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have one more config , For Ex: MaxSegmentSize which will deletes the logs when it touches the disk space size. which will be useful in some situations where the following error throws
msg=failed to write entry component=loki.write.remote err=write /tmp/agent/loki.write.remote/wal/00000569: no space left on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the device that backs the WAL running out of space is a smell that something is happening, between:
- the agent scraping more that it can send
- the WAL volume is too small
- one should be able to make the WAL delete all segments more rapidly with
max_segment_age
, which makes sense since the WAL needs to be thought more of keeping the last X minutes of data rather than X bytes of data
Can you share a bit more of your use case, to understand why the WAL volume is reaching it's capacity?
f6ea92c
to
5cde5fc
Compare
@mattdurham all comments should be resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* drain test and routing * refactoring watcher to use state * drain working * fine-tuned test case * added short timeout test * prompt exit test * prompt exit passing * refactoring watcher * some comments * map river configs * add docs * splitting apart Stop and Drain * minimize manager stop time
PR Description
This PR is a follow up after some discussions in #5770.
The idea of this PR is to allow
loki.write
to do a graceful shutdown when the component is stopped. This graceful shutdown will attempt to drain the WAL, making the watcher consume all of the segments it can, before a timeout occurs. Note that two timeouts will managet the WAL enabledloki.write
component: the WAL drain timeout, and the queue client drain timeout.Also, the Watcher is refactored a bit to simlify the "possible states" in which this operates.
Which issue(s) this PR fixes
Related to https://github.com/grafana/cloud-onboarding/issues/5407
Notes to the Reviewer
PR Checklist