Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2991 killport: revert to using SIGKILL #512

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

eramongodb
Copy link
Contributor

@eramongodb eramongodb commented Oct 1, 2024

Important

After external discussion, agreed to revert to using SIGKILL instead of dealing with SIGTERM + wait complexity. PR updated accordingly; prior PR description preserved below.


This PR adds routines to wait up to 1 minute for existing processes to gracefully terminate before sending a SIGKILL.

#506 downgraded signals from SIGKILL to SIGTERM to allow existing processes (usually a MongoDB server instance) to gracefully terminate. However, this leads to flaky OSError: No socket could be created -- (('127.0.0.1', 8889): [Errno 98] Address already in use) errors (particularly on Ubuntu 18.04 for some reason) if the server shutdown takes too long. This is because kill is not a blocking command.

Therefore, timeout is used to wait up to 60 seconds for existing processes to terminate before issuing a SIGKILL. Because wait is a Bash builtin, it cannot be used with timeout; therefore, while + sleep is used instead. For routines using kill, kill -0 is used to query the existence of the terminating process. For routines using fuser, fuser -s is used instead.

Given start-orchestration.sh is a Bash script and kill is a Bash builtin command, kill should be available even on Windows distros. Therefore, this PR also replaces taskkill with kill for consistency.

@eramongodb eramongodb self-assigned this Oct 1, 2024
@eramongodb eramongodb force-pushed the det-killport-wait branch 4 times, most recently from 28ca793 to 8f7968d Compare October 1, 2024 17:32
@@ -67,17 +67,21 @@ killport() {

if [[ "${OSTYPE:?}" == cygwin || "${OSTYPE:?}" == msys ]]; then
for pid in $(netstat -ano | grep ":$port .* LISTENING" | awk '{print $5}' | tr -d '[:space:]'); do
taskkill /F /T /PID "$pid" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/F already force kills so we shouldn't need to wait here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR proposes using kill instead of taskkill to extend graceful termination + timeout->kill to Windows distros as well. Please re-review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert this now as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of keeping this change as a simplification + consistency improvement given this script is known to be Bash-executed. The less distro-specific patterns we need to depend on, the better.

Copy link
Contributor

@ShaneHarvey ShaneHarvey Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this one. taskkill /F /T kills the process and all child processes which seems safer than just kill -SIGKILL. I don't want to make windows less stable after this change so I'd prefer we keep it using taskkill

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

done
elif [ -x "$(command -v lsof)" ]; then
for pid in $(lsof -t "-i:$port" || true); do
kill "$pid" || true
timeout 60s bash -c "while kill -0 \"$pid\" 2>/dev/null; do sleep 1; done" || kill -9 "$pid" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I changed it to SIGTERM was to give MO a chance to cleanup its mongo processes but that doesn't seem to work. If we're going back to kill -9 (SIGKILL) instead of SIGTERM then can we remove the timeout logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of -9 in the while loop was a typo. -0 was meant to be used instead. Please re-review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to just kill -9? What benefit do we get from trying to shutdown gracefully?

Copy link
Contributor Author

@eramongodb eramongodb Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting @nbbeeken from Slack discussions regarding #506:

The old kill commands passed a -9 flag (SIGKILL) before the PID. Default signal for kill without args is SIGTERM (15). Should the new kill commands be passing SIGKILL as well?

The switch to sigterm was intentional because mongo-orch will try to clean up the servers gracefully, but maybe there's a bug in that logic and we should send -9 anyway :blob-sad: you try to be a 'good citizen' and look what it gets you

If we want to revert to the simpler "just SIGKILL" routines we had before, that would also address the flaky "address already in use" problem this PR is trying to solve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Jupyter we use the pattern of SIGTERM, wait on pid, SIGKILL as well. I think it is a good pattern in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd prefer SIGKILL at this point because it should save time (don't need to wait 60 seconds) and will make this code easier to read.

@eramongodb eramongodb marked this pull request as ready for review October 1, 2024 17:45
@eramongodb eramongodb changed the title killport: wait up to 1 min for server shutdown before kill-and-continue killport: revert to using SIGKILL Oct 1, 2024
@eramongodb
Copy link
Contributor Author

eramongodb commented Oct 1, 2024

Updated PR so changes are to revert to using SIGKILL instead of SIGTERM + wait.

The taskkill -> kill change is preserved as a simplification and consistency improvement. Reverted per #512 (comment).

Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@eramongodb eramongodb requested review from blink1073 and removed request for nbbeeken October 1, 2024 20:40
@ShaneHarvey ShaneHarvey changed the title killport: revert to using SIGKILL DRIVERS-2991 killport: revert to using SIGKILL Oct 1, 2024
@eramongodb eramongodb merged commit a1bccca into mongodb-labs:master Oct 1, 2024
47 of 48 checks passed
@eramongodb eramongodb deleted the det-killport-wait branch October 1, 2024 21:11
adriandole pushed a commit to adriandole/drivers-evergreen-tools that referenced this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants