Add graceful shutdown to `loki.write` draining the WAL #5804

thepalbi · 2023-11-17T19:51:24Z

PR Description

This PR is a follow up after some discussions in #5770.

The idea of this PR is to allow loki.write to do a graceful shutdown when the component is stopped. This graceful shutdown will attempt to drain the WAL, making the watcher consume all of the segments it can, before a timeout occurs. Note that two timeouts will managet the WAL enabled loki.write component: the WAL drain timeout, and the queue client drain timeout.

Also, the Watcher is refactored a bit to simlify the "possible states" in which this operates.

Which issue(s) this PR fixes

Related to https://github.com/grafana/cloud-onboarding/issues/5407

Notes to the Reviewer

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

docs/sources/flow/reference/components/loki.write.md

clayton-cornell

Doc change is 👍

component/common/loki/wal/config.go

component/common/loki/wal/watcher.go

ramcharan-tech · 2023-11-23T12:41:03Z

component/loki/write/write_test.go

@@ -90,19 +91,22 @@ func TestUnmarshallWalAttrributes(t *testing.T) {
 				MaxSegmentAge:    wal.DefaultMaxSegmentAge,
 				MinReadFrequency: wal.DefaultWatchConfig.MinReadFrequency,
 				MaxReadFrequency: wal.DefaultWatchConfig.MaxReadFrequency,
+				DrainTimeout:     wal.DefaultWatchConfig.DrainTimeout,


Can we have one more config , For Ex: MaxSegmentSize which will deletes the logs when it touches the disk space size. which will be useful in some situations where the following error throws
msg=failed to write entry component=loki.write.remote err=write /tmp/agent/loki.write.remote/wal/00000569: no space left on device.

In my opinion, the device that backs the WAL running out of space is a smell that something is happening, between:

the agent scraping more that it can send

the WAL volume is too small

one should be able to make the WAL delete all segments more rapidly with max_segment_age, which makes sense since the WAL needs to be thought more of keeping the last X minutes of data rather than X bytes of data

Can you share a bit more of your use case, to understand why the WAL volume is reaching it's capacity?

thepalbi · 2023-12-06T20:47:58Z

@mattdurham all comments should be resolved

mattdurham

LGTM

* drain test and routing * refactoring watcher to use state * drain working * fine-tuned test case * added short timeout test * prompt exit test * prompt exit passing * refactoring watcher * some comments * map river configs * add docs * splitting apart Stop and Drain * minimize manager stop time

thepalbi requested review from mattdurham and tpaschalis November 17, 2023 20:58

thepalbi mentioned this pull request Nov 17, 2023

Add readyness API to net.server #5770

Closed

4 tasks

thepalbi requested a review from clayton-cornell as a code owner November 17, 2023 21:03

clayton-cornell reviewed Nov 17, 2023

View reviewed changes

docs/sources/flow/reference/components/loki.write.md Show resolved Hide resolved

clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 17, 2023

mattdurham self-assigned this Nov 17, 2023

clayton-cornell reviewed Nov 17, 2023

View reviewed changes

thepalbi force-pushed the pablo/loki-write-drain-on-stop branch from 3514f0f to 6c33067 Compare November 21, 2023 13:10

mattdurham reviewed Nov 22, 2023

View reviewed changes

component/common/loki/wal/config.go Outdated Show resolved Hide resolved

mattdurham reviewed Nov 22, 2023

View reviewed changes

component/common/loki/wal/watcher.go Outdated Show resolved Hide resolved

mattdurham reviewed Nov 22, 2023

View reviewed changes

component/common/loki/wal/watcher.go Outdated Show resolved Hide resolved

ramcharan-tech reviewed Nov 23, 2023

View reviewed changes

thepalbi added 13 commits December 6, 2023 17:46

drain test and routing

3750250

refactoring watcher to use state

8244c99

drain working

7180974

fine-tuned test case

8e4bb05

added short timeout test

185c72a

prompt exit test

19dffff

prompt exit passing

da31c8a

refactoring watcher

dd3f4d8

some comments

7d2e197

map river configs

fcf3ca8

add docs

d45d5bf

splitting apart Stop and Drain

ed3190b

minimize manager stop time

5cde5fc

thepalbi force-pushed the pablo/loki-write-drain-on-stop branch from f6ea92c to 5cde5fc Compare December 6, 2023 20:46

Merge branch 'main' into pablo/loki-write-drain-on-stop

e06c0e7

mattdurham enabled auto-merge (squash) December 11, 2023 17:40

mattdurham approved these changes Dec 11, 2023

View reviewed changes

mattdurham merged commit e8fd587 into main Dec 11, 2023
10 checks passed

mattdurham deleted the pablo/loki-write-drain-on-stop branch December 11, 2023 20:24

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add graceful shutdown to `loki.write` draining the WAL #5804

Add graceful shutdown to `loki.write` draining the WAL #5804

thepalbi commented Nov 17, 2023 •

edited

Loading

clayton-cornell left a comment

ramcharan-tech Nov 23, 2023

thepalbi Nov 23, 2023

thepalbi commented Dec 6, 2023

mattdurham left a comment

Add graceful shutdown to loki.write draining the WAL #5804

Add graceful shutdown to loki.write draining the WAL #5804

Conversation

thepalbi commented Nov 17, 2023 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

clayton-cornell left a comment

Choose a reason for hiding this comment

ramcharan-tech Nov 23, 2023

Choose a reason for hiding this comment

thepalbi Nov 23, 2023

Choose a reason for hiding this comment

thepalbi commented Dec 6, 2023

mattdurham left a comment

Choose a reason for hiding this comment

Add graceful shutdown to `loki.write` draining the WAL #5804

Add graceful shutdown to `loki.write` draining the WAL #5804

thepalbi commented Nov 17, 2023 •

edited

Loading