Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SIGTERM gracefully #48

Open
zerebubuth opened this issue Nov 8, 2016 · 4 comments
Open

Handle SIGTERM gracefully #48

zerebubuth opened this issue Nov 8, 2016 · 4 comments

Comments

@zerebubuth
Copy link
Member

Upon receiving SIGTERM, tileserver should:

  1. Start responding to health check requests with an error.
  2. Wait a configurable grace period, or until all outstanding requests have finished.
  3. Shut down.

This allows tileserver to work with ELB connection draining or HAProxy to terminate while not dropping any requests. If this is done along with staggering shutdowns / upgrades so that only part of the cluster is down at any one time, then no requests are lost.

@rmarianski
Copy link
Member

This is a good idea. On the one hand we can avoid this by rolling in new instances, but on the other, it's much easier to just run a deploy command in opsworks.

It might be worth considering pushing the scope of this problem outside into an opsworks tools that rolls in the deploy for us. That way it's solved for any service in an opsworks layer. Or, maybe there's a way to not require to roll in deploys, but still handle this mostly outside the actual process. I wonder if we can unregister the instance from the elb, wait until it's unregistered, and then re-register it once it's restarted. I'm assuming the wait step here handles the connection draining for us, and that opsworks wouldn't fight us and try to re-register the instance because it's still in the layer in the interim.

@zerebubuth
Copy link
Member Author

I think both mechanisms would be good to have.

Rolling the deploy requires outside tooling, which is great for anything which is compatible with that. But I wouldn't be confident that it covers 100% of all cases that the service could be stopped. Handling SIGTERM internally is then a safety net in those (hopefully rare) cases that tileserver is stopped outside of a rolling deploy.

@rmarianski
Copy link
Member

@rmarianski
Copy link
Member

But I wouldn't be confident that it covers 100% of all cases that the service could be stopped.

Just curious, what kind of cases would this be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants