Add readyness API to net.server #5770

thepalbi · 2023-11-14T18:31:07Z

PR Description

This PR adds an opt-in readyness API for the common/net/server.go. The idea is that if one hosts a set of agents in something like kubernetes, where readyness probes can be defined, this will allow the toggling of the readyness state. This comes useful for manually draining a pod, for example, if using loki.source components that expose a network server. The draining procude would be, assumming agents are hosted in a statefulset.

find the highest numbered pod
toggle the readyness state with the PUT /server/toggle_ready endpoint
wait, using metrics, for the desired draining state (for example, using WAL metrics for loki.source and loki.write)
Downscale statefulset so that this pod can be evicted

Which issue(s) this PR fixes

Part of https://github.com/grafana/cloud-onboarding/issues/5407

Notes to the Reviewer

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

thepalbi · 2023-11-14T18:32:30Z

@grafana/grafana-agent-signals-maintainers what do you think of this? Even though the procedure is manual, this will allow one to effectively drain the WAL from agent pods deployed in a statefulset, instead of just shutting them down without knowing if there's still data in the WAL.

mattdurham · 2023-11-15T13:55:53Z

component/common/net/server.go

@@ -58,6 +63,28 @@ func NewTargetServer(logger log.Logger, metricsNamespace string, reg prometheus.
 	return ts, nil
 }

+func (ts *TargetServer) registerManagementAPI(router *mux.Router) {


Since we have some form of config management this may need a different name to be easily differentiated.

mattdurham · 2023-11-15T14:48:36Z

This feels like global solution to a very specific fix. My intuition is this state change should be pushed to components so they can take individual action additionally. For instance any component that is getting logs by polling or metrics scrapers will continue while the readiness probe is false. This should also hook into clustering since ideally this removes the pod from the cluster. @rfratto and @tpaschalis for second opinions.

rfratto · 2023-11-15T17:15:17Z

I also have some reservations about this, since I don't fully understand the use case yet and whether we need a new concept to solve that use case.

Why isn't the normal lifecycle of the pod sufficient, where a pod is terminated and it finishes any work that it needs to do on shutdown before fully exiting?

thepalbi · 2023-11-15T18:51:38Z

@rfratto so the use case is the following. Let's say you host a bunch of agents in a StatefulSet with a PVC for WAL. All of them have a config like the following:

loki.source.awsfirehose "receiver" {
  http {
    listen_address = "0.0.0.0"
    listen_port = 8080
  }
  forward_to = [loki.write.cell.receiver]
  use_incoming_timestamp = false
}
loki.write "cell" {
  endpoint {
    max_backoff_period = "7s"
    url = "loki.com"
  }
  wal {
    enabled = true
  }
}

At some point we notice the agents are receiving a bunch of traffic, so we scale up to N replicas. Later, when the surge finished and traffic is back to normal, one would like to scale down the statefulset. When scaling down the statefulset, that means one of the agent inside will be evicted, and the PVC lost. Since the PVC contains the WAL, we need some mechanism to performa graceful shutdown, which looks as follows:

user decides to scale down the sfset
user decreases the number of replicas in the statefulset
"shutdown with drain signal is sent to the agent being evicted" (in a statefulset that is the replica with the highest number)
agent shuts down first the loki.source..+ components, so no more data is received
agent tells WAL watcher to burn through WAL
when 2) is done with some timeout, agent exits
statefulset now has one less replica, and data in the WAL has been evicted.

This PR implements an endpoint so step 1) can be done manually, marking as un-ready the agent, and the k8s cluster avoiding routing more traffic. I think ideally, we'd want this procedure to be automattic, with the configured timeouts, upon a SIG that stops the agent. Any ideas/recommendations?

rfratto · 2023-11-15T19:09:11Z

If I understand your use case correctly, that behavior can already be achieved today without introducing new probes: have the Run method of your component flush data before returning after the context passed to Run is canceled.

When the agent shuts down today, the Flow controller will terminate all running components. The Flow controller will wait for all components to finish shutting down before finally exiting the process. Kubernetes has a grace period where it waits for a pod to exit gracefully before force killing it; you can tune that setting on the pod if you need more time to flush data.

thepalbi · 2023-11-15T20:02:43Z

If I understand your use case correctly, that behavior can already be achieved today without introducing new probes: have the Run method of your component flush data before returning after the context passed to Run is canceled.

When the agent shuts down today, the Flow controller will terminate all running components. The Flow controller will wait for all components to finish shutting down before finally exiting the process. Kubernetes has a grace period where it waits for a pod to exit gracefully before force killing it; you can tune that setting on the pod if you need more time to flush data.

Mmm interesting, gotta explore that option. The only thing, is say in the example above I have the loki.source.awsfirehose -> loki.write.cell. How does the controller determine the components shutdown order? Is it all at once, or it follows the DAG?

rfratto · 2023-11-16T13:59:23Z

Mmm interesting, gotta explore that option. The only thing, is say in the example above I have the loki.source.awsfirehose -> loki.write.cell. How does the controller determine the components shutdown order? Is it all at once, or it follows the DAG?

Currently, the controller will terminate all the components at the same time:

agent/pkg/flow/internal/controller/scheduler.go

Lines 106 to 112 in 461a4b2

    
           // Close stops the Scheduler and returns after all running goroutines have 
        
           // exited. 
        
           func (s *Scheduler) Close() error { 
        
           	s.cancel() 
        
           	s.running.Wait() 
        
           	return nil 
        
           }

thepalbi · 2023-11-17T20:58:19Z

Closing in favour of #5804

add readyness api

651cbc4

thepalbi requested a review from mattdurham November 14, 2023 19:01

mattdurham reviewed Nov 15, 2023

View reviewed changes

thepalbi mentioned this pull request Nov 17, 2023

Add graceful shutdown to loki.write draining the WAL #5804

Merged

4 tasks

thepalbi closed this Nov 17, 2023

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add readyness API to net.server #5770

Add readyness API to net.server #5770

thepalbi commented Nov 14, 2023

thepalbi commented Nov 14, 2023

mattdurham Nov 15, 2023

mattdurham commented Nov 15, 2023

rfratto commented Nov 15, 2023

thepalbi commented Nov 15, 2023

rfratto commented Nov 15, 2023

thepalbi commented Nov 15, 2023

rfratto commented Nov 16, 2023

thepalbi commented Nov 17, 2023

Add readyness API to net.server #5770

Add readyness API to net.server #5770

Conversation

thepalbi commented Nov 14, 2023

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

thepalbi commented Nov 14, 2023

mattdurham Nov 15, 2023

Choose a reason for hiding this comment

mattdurham commented Nov 15, 2023

rfratto commented Nov 15, 2023

thepalbi commented Nov 15, 2023

rfratto commented Nov 15, 2023

thepalbi commented Nov 15, 2023

rfratto commented Nov 16, 2023

thepalbi commented Nov 17, 2023