Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pausing and resuming consumers #4966

Closed
ripienaar opened this issue Jan 17, 2024 · 15 comments
Closed

Support pausing and resuming consumers #4966

ripienaar opened this issue Jan 17, 2024 · 15 comments
Assignees
Labels
proposal Enhancement idea or proposal
Milestone

Comments

@ripienaar
Copy link
Contributor

ripienaar commented Jan 17, 2024

Proposed change

Introduce an API on $JS.API.CONSUMER.PAUSE.*.* that takes as request:

type JSApiConsumerPauseRequest struct {
    PauseUntil *time.Time `json:"pause_until,omitempty"`
}

The consumer will set itself in a paused state but continue to handle acks for in-flight messages. No further message deliveries will be done after this point, other than deliveries being inhibited the consumer functions as usual.

If a delay is given a timer will auto-resume the consumer. If no time or a time in the past is given a paused consumer will resume.

Consumer info includes 2 new fields:

  Paused bool `json:"paused,omitempty"`
  PauseRemaining time.Duration `json:"pause_remaining,omitempty"`

The paused state and time time would need to be persisted to the raft layer such that server restarts would not unpause paused consumers. This is done using the consumer configuration that has a new value:

PauseUntil time.Time `json:"pause_until,omitempty"`

When given at create time this creates a paused consumer, it's not updatable at runtime using a configuration update, but the PAUSE api will update this setting. Essentially the only way to change this post-create is with the PAUSE API.

Advisories for pause and unpause to be added on io.nats.jetstream.advisory.v1.consumer_pause with pertinant info

Use case

It is difficult to schedule maintenance on central resources on a large distributed system where 100s or 1000s of clients are accessing data in a stream.

We would like to be able to pause a Consumer such that it appears healthy but just doesnt deliver any messages.

During the pause maintenance can happen and resources accessed by clients will not be under constant pressure, later the stream can be unpaused and work will continue.

This would happen without impacting running clients - other than they would see pending messages in stream info but not get any deliveries.

This would apply to push and pull consumers.

Contribution

No response

@ripienaar ripienaar added the proposal Enhancement idea or proposal label Jan 17, 2024
@bruth bruth added this to the 2.11.0 milestone Jan 17, 2024
@derekcollison derekcollison self-assigned this Jan 17, 2024
@derekcollison
Copy link
Member

Should delay just be a parseable string? "1s", "2h"? If we can't parse we return an error.

Do we want to have maximum and minimums or start simple and add in limits as needed?

@ripienaar
Copy link
Contributor Author

We don’t have other cases of such strings in the API it’s also a bit go centric so Duration seems best and let UIs handle it as they wish be it strings like that in CLI or some kind oh picker on web

let’s start simple.

@derekcollison
Copy link
Member

ok, but if we use time.Duration then its nanos, not millis.. But I hear you on consistency..

@ripienaar
Copy link
Contributor Author

Indeed - nanos. Will fix.

@derekcollison
Copy link
Member

@neilalexander and @Jarema could you work with @ripienaar and this writeup and schedule this work?

@Jarema
Copy link
Member

Jarema commented Jan 23, 2024

@derekcollison this has been scheduled to start on the 5th of February, with a plan to finish before the 16th of February. @neilalexander will be working on it.

@Jarema
Copy link
Member

Jarema commented Feb 6, 2024

@ripienaar @neilalexander Can I ask for an update of the final design after recent discussions?

@ripienaar
Copy link
Contributor Author

from my perspective I think the pause/resume APIs are still the right direction. Details for how we actually implement that in a way thats not massive plumbing in the server is for @neilalexander to comment

@derekcollison
Copy link
Member

I vote it should just be part of the consumer config, with no new API endpoints.

@ripienaar
Copy link
Contributor Author

At this point I'd say lets just not add this feature. We can go back and find requirements.

As it stands the few requirements we do have will not be met without these extra APIs, so lets just close the issue and move on.

@derekcollison
Copy link
Member

I thought it would be easier but not impossible, you are saying they would require securing just that functionality vs general update yes? And without general callouts we only have new APIs to secure individually, that correct?

@ripienaar
Copy link
Contributor Author

Yes, I think there is a need to cater for 2 distinct users - operational needs and configuration needs. Often configuration may not be changed without approvals by change advisory boards etc.

Doing maintenance should not require a configuration change.

Those doing maintenance should not need to be authorized to do a configuration change.

@ripienaar
Copy link
Contributor Author

Capturing a discussion that keeps coming up around this one:

Question: Should the paused until configuration be updatable as configuration?
Answer: We have the pattern where updates to consumer configs are idempotent and as a result applications set their confguration at startup often. We added the action to help distinguish a bit, its problematic though as that is not something one can do authz against today.

Given this pattern the question is who owns this property? If an administrator sets the pause state to x and the app starting up sets it to start-paused or unpaused, how is the system to distinguish between a normal app making the API call to create a paused/unpaused consumer and a admin asking the consumer to be paused?

I dont think the API has the context of who is calling it for what reason and it would be undesirable to allow a unexpected config update by a starting worker to unpause a consumer.

It's essential that the responsibilities of creation and administration be seperate here, it could be created paused - but a administrator must be able to unpause it and know if that creation is run again it will not again be paused. Or if an administrator overrides the pause from 1 hour to 10 minutes that a service startup does not again set it back to 1 hour.

I cant think of a way to capture this distinction (except maybe (ab)using the action property? But see authz comments and about roles and responsibilties). Happy to hear if there's a design solution that both allows this property to be updated as config and the ownership of who has responsibility for its management to be retained.

@ripienaar
Copy link
Contributor Author

Related server PR #5066

@Jarema
Copy link
Member

Jarema commented Feb 28, 2024

Server PR has been merged 🎉
Closing the issue.
This feature will be part of release 2.11

@Jarema Jarema closed this as completed Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

No branches or pull requests

5 participants