-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-2991 killport: revert to using SIGKILL #512
Conversation
28ca793
to
8f7968d
Compare
8f7968d
to
76e90f0
Compare
.evergreen/start-orchestration.sh
Outdated
@@ -67,17 +67,21 @@ killport() { | |||
|
|||
if [[ "${OSTYPE:?}" == cygwin || "${OSTYPE:?}" == msys ]]; then | |||
for pid in $(netstat -ano | grep ":$port .* LISTENING" | awk '{print $5}' | tr -d '[:space:]'); do | |||
taskkill /F /T /PID "$pid" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/F
already force kills so we shouldn't need to wait here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR proposes using kill
instead of taskkill
to extend graceful termination + timeout->kill to Windows distros as well. Please re-review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revert this now as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of keeping this change as a simplification + consistency improvement given this script is known to be Bash-executed. The less distro-specific patterns we need to depend on, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about this one. taskkill /F /T
kills the process and all child processes which seems safer than just kill -SIGKILL
. I don't want to make windows less stable after this change so I'd prefer we keep it using taskkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.evergreen/start-orchestration.sh
Outdated
done | ||
elif [ -x "$(command -v lsof)" ]; then | ||
for pid in $(lsof -t "-i:$port" || true); do | ||
kill "$pid" || true | ||
timeout 60s bash -c "while kill -0 \"$pid\" 2>/dev/null; do sleep 1; done" || kill -9 "$pid" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I changed it to SIGTERM was to give MO a chance to cleanup its mongo processes but that doesn't seem to work. If we're going back to kill -9
(SIGKILL) instead of SIGTERM then can we remove the timeout logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of -9
in the while loop was a typo. -0
was meant to be used instead. Please re-review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to just kill -9
? What benefit do we get from trying to shutdown gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting @nbbeeken from Slack discussions regarding #506:
The old kill commands passed a
-9
flag (SIGKILL) before the PID. Default signal forkill
without args is SIGTERM (15). Should the new kill commands be passing SIGKILL as well?The switch to sigterm was intentional because mongo-orch will try to clean up the servers gracefully, but maybe there's a bug in that logic and we should send -9 anyway :blob-sad: you try to be a 'good citizen' and look what it gets you
If we want to revert to the simpler "just SIGKILL" routines we had before, that would also address the flaky "address already in use" problem this PR is trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Jupyter we use the pattern of SIGTERM, wait on pid, SIGKILL as well. I think it is a good pattern in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd prefer SIGKILL at this point because it should save time (don't need to wait 60 seconds) and will make this code easier to read.
Updated PR so changes are to revert to using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Important
After external discussion, agreed to revert to using SIGKILL instead of dealing with SIGTERM + wait complexity. PR updated accordingly; prior PR description preserved below.
This PR adds routines to wait up to 1 minute for existing processes to gracefully terminate before sending a SIGKILL.
#506 downgraded signals from SIGKILL to SIGTERM to allow existing processes (usually a MongoDB server instance) to gracefully terminate. However, this leads to flaky
OSError: No socket could be created -- (('127.0.0.1', 8889): [Errno 98] Address already in use)
errors (particularly on Ubuntu 18.04 for some reason) if the server shutdown takes too long. This is becausekill
is not a blocking command.Therefore,
timeout
is used to wait up to 60 seconds for existing processes to terminate before issuing a SIGKILL. Becausewait
is a Bash builtin, it cannot be used withtimeout
; therefore,while
+sleep
is used instead. For routines usingkill
,kill -0
is used to query the existence of the terminating process. For routines usingfuser
,fuser -s
is used instead.Given
start-orchestration.sh
is a Bash script andkill
is a Bash builtin command,kill
should be available even on Windows distros. Therefore, this PR also replacestaskkill
withkill
for consistency.