[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

skonto · 2024-04-29T11:43:09Z

Backport of Gracefully drain connections when stopping the gateway knative-extensions/net-kourier#1203

@skonto

…sions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint

skonto · 2024-04-29T11:44:06Z

/hold I will test downstream at the S-O side.

codecov-commenter · 2024-04-29T11:48:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.31%. Comparing base (d7abd37) to head (c228fcc).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@                Coverage Diff                @@
##           release-v1.12     #130      +/-   ##
=================================================
+ Coverage          60.63%   62.31%   +1.67%     
=================================================
  Files                 24       24              
  Lines               2002     1632     -370     
=================================================
- Hits                1214     1017     -197     
+ Misses               726      553     -173     
  Partials              62       62

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ReToCode

/lgtm
/approve

openshift-ci · 2024-04-29T12:10:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ReToCode, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ReToCode,skonto]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ReToCode · 2024-04-29T12:11:01Z

/cherry-pick release-v1.14

openshift-cherrypick-robot · 2024-04-29T12:11:04Z

@ReToCode: once the present PR merges, I will cherry-pick it on top of release-v1.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-v1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

skonto · 2024-04-29T12:12:06Z

openshift/release/artifacts/net-kourier.yaml

@@ -526,7 +542,9 @@ spec:
              memory: 200Mi
            limits:
              cpu: "1"
-              memory: 500Mi
+              memory: 800Mi


Bringing this from the upstream. @ReToCode do we need to update any docs if we bring this in?

I don't think so, we have not documented limits, just what is a typical use with a certain amount of service. Maybe as a release note to raise awareness?

skonto · 2024-04-30T07:03:42Z

/unhold

Tests passed.

openshift-cherrypick-robot · 2024-04-30T07:06:54Z

@ReToCode: #130 failed to apply on top of branch "release-v1.14":

Applying: Gracefully drain connections when stopping the gateway (#1203)
Using index info to reconstruct a base tree...
M	config/300-gateway.yaml
Falling back to patching base and 3-way merge...
Auto-merging config/300-gateway.yaml
CONFLICT (content): Merge conflict in config/300-gateway.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Gracefully drain connections when stopping the gateway (#1203)
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-v1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@skonto

* [SRVKS-1171] [RELEASE-1.12] Graceful shutdown (#130) * Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <[email protected]> * update deps --------- Co-authored-by: norbjd <[email protected]>

norbjd and others added 2 commits April 29, 2024 14:36

run hack/update-deps.sh

a1e5ec2

openshift-ci bot requested review from alanfx and ReToCode April 29, 2024 11:43

openshift-ci bot added the approved label Apr 29, 2024

openshift-ci bot added the do-not-merge/hold label Apr 29, 2024

skonto changed the title ~~[SRVKS-1171] [RELEASE-1.12] Graceful shutdown 1.12~~ [SRVKS-1171] [RELEASE-1.12] Graceful shutdown Apr 29, 2024

update openshift files

c228fcc

ReToCode approved these changes Apr 29, 2024

View reviewed changes

openshift-ci bot assigned ReToCode Apr 29, 2024

openshift-ci bot added the lgtm label Apr 29, 2024

skonto commented Apr 29, 2024

View reviewed changes

skonto mentioned this pull request Apr 29, 2024

[wip] Test Kourier graceful shutdown openshift-knative/serverless-operator#2623

Closed

openshift-ci bot removed the do-not-merge/hold label Apr 30, 2024

openshift-merge-bot bot merged commit 553b97a into openshift-knative:release-v1.12 Apr 30, 2024
18 checks passed

This was referenced May 14, 2024

Bump Serving manifests to 1.14 openshift-knative/serverless-operator#2656

Merged

Graceful shutdown 1.14 #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

skonto commented Apr 29, 2024

skonto commented Apr 29, 2024 •

edited

Loading

codecov-commenter commented Apr 29, 2024 •

edited

Loading

ReToCode left a comment

openshift-ci bot commented Apr 29, 2024

ReToCode commented Apr 29, 2024

openshift-cherrypick-robot commented Apr 29, 2024

skonto Apr 29, 2024

ReToCode Apr 29, 2024

skonto commented Apr 30, 2024

openshift-cherrypick-robot commented Apr 30, 2024

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

Conversation

skonto commented Apr 29, 2024

skonto commented Apr 29, 2024 • edited Loading

codecov-commenter commented Apr 29, 2024 • edited Loading

Codecov Report

ReToCode left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Apr 29, 2024

ReToCode commented Apr 29, 2024

openshift-cherrypick-robot commented Apr 29, 2024

skonto Apr 29, 2024

Choose a reason for hiding this comment

ReToCode Apr 29, 2024

Choose a reason for hiding this comment

skonto commented Apr 30, 2024

openshift-cherrypick-robot commented Apr 30, 2024

skonto commented Apr 29, 2024 •

edited

Loading

codecov-commenter commented Apr 29, 2024 •

edited

Loading