Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

Merged

Conversation

skillcoder
Copy link

@skillcoder skillcoder commented May 8, 2024

Changes

fix 2 problems after merge #1000

  • unexpected jenkins restarts because of race in find (read dir and stat of file) (avoid searching in subfolders)
  • incorrect backup num determine, because search also in subfolders (avoid searching in subfolders)
  • always test latest backup image in e2e

Also following changes in run.sh

  • refactor required ENV checks
  • change backup container logs
    • add warning on undefined BACKUP_COUNT on startup
    • show backup cleanup interval on startup
    • show logs only on removing backups
    • show removed backup files names
  • extract list of exceeding backups into function
  • change backup cleanup interval 10s -> 60s
  • speedup tests by introducing BACKUP_CLEANUP_INTERVAL

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS.

Release Notes

- Bug fixes
  - avoid unexpected jenkins restarts due to race in `find` in backup cleanup `run.sh` script
  - proper determine latest backup number (do not search in /backup subfolders)
- Any changes in behavior
  - now we do not search for backups in subfolders
  - backup container produce less logs, only valuable information
  - backup container logs now consist list of each deleted backups
  - backup cleanup interval changed from 10s to 60s
  - added warning in case of missing BACKUP_COUNT ENV

@skillcoder
Copy link
Author

@brokenpip3 Could you please take a look.
Yesterday I've noticed my changes in #1000 introduced 1 issue and one bug.
This is a fix for these issues and small refactoring + implemented TODO

Probably I also need to add a tests for these cases.

@brokenpip3
Copy link
Collaborator

I will take a look as soon it will be possible, but before that: in PR #1000 you only changed the make tmp dir location, nothing else, why it could be an issue with find and max depth?
Again I still need to check the code

@brokenpip3
Copy link
Collaborator

(In general: don't worry it's not the default version in the chart/code, I was planning to ship it with a 0.8.1 version before merging the 0.9.0 PR that is ready, just need to fix the tests)

@skillcoder
Copy link
Author

Let me explain the find race problem and why it is directly related to changes in #1000.

TL;DR; After #1000 temp directory and new backup archives creating in /backup subfolders which is directly affected find behavior in get-latest.sh and run.sh, also since in run.sh we use set -eo pipefail any non-zero exit code cause container restart and operator recreate the pod with jenkins master on every such restart.

find command by design have race in case of directory removed during find works, this is because find read directory and after try to stat files in the dir, and if this dir removed in the middle, find show error and at the end return non-zero exit code.
And actually temp directories could be removed in parallel with cleanup process.

Also I found another critical problem related to incomplete backup, in case of jenkins pods restarts current backup could be left in temp folder and get-latest.sh will use this incomplete backup filename to restore backup during jenkins startup.

This PR fix these 2 problems, but still need to implement automatic cleanup for incomplete backups.

Also seems I found critical security problem in operator release pipeline.

@brokenpip3
Copy link
Collaborator

After #1000 temp directory and new backup archives creating in /backup subfolders which is directly affected find behavior in get-latest.sh and run.sh, also since in run.sh we use set -eo pipefail any non-zero exit code cause container restart and operator recreate the pod with jenkins master on every such restart.

yes it make sense indeed, I remember we limited the find in the past with maxdepth, not sure why it's disappeared (will check the commits).
Thanks a lot for this and for fixing the tests, however this week and the next one it will be hard for me to digest the changes and review the PR. Will try my best

@skillcoder
Copy link
Author

Here the scripts to reproduce race condition in find which we had in our production.
Need to run them in parallel
del_loop.sh

#!/usr/bin/env bash

set -eo pipefail

BACKUP_DIR=./backup

while true;
do
  BACKUP_TMP_DIR=$(mktemp -d --tmpdir="${BACKUP_DIR}")
  touch "${BACKUP_TMP_DIR}/6.tar.zstd"
  rm -rf "${BACKUP_TMP_DIR}"
done

find.sh

#!/usr/bin/env bash

set -eo pipefail

BACKUP_DIR=./backup
BACKUP_COUNT=3

touch "${BACKUP_DIR}"/1.tar.zstd
touch "${BACKUP_DIR}"/2.tar.zstd
touch "${BACKUP_DIR}"/3.tar.zstd
touch "${BACKUP_DIR}"/4.tar.zstd
touch "${BACKUP_DIR}"/5.tar.zstd

while true;
do
  find ${BACKUP_DIR} -maxdepth 1 -name '*.tar.zstd' -exec basename {} \; | sort -gr | tail -n +$((BACKUP_COUNT +1))
  echo
done

Probably need to add such test, but I'm not sure how to run several process in parallel, probably it is possible only with &

@brokenpip3
Copy link
Collaborator

Probably need to add such test, but I'm not sure how to run several process in parallel, probably it is possible only with &

we can use gnu parallel, however we have several test around the backup in multiple forms, so I'm satisfied around the current coverage.

Copy link
Collaborator

@brokenpip3 brokenpip3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution 🙇

@brokenpip3
Copy link
Collaborator

There were some issues in the tests, will take a look

@brokenpip3
Copy link
Collaborator

brokenpip3 commented Jun 3, 2024

Sorry I lost this 2 comments:

This PR fix these 2 problems, but still need to implement automatic cleanup for incomplete backups.

what you mean? we have this trap for that

Also seems I found critical security problem in operator release pipeline.

what you mean? you mean the vulnerabilities that the latest version of the operator contains? those are related to golang and operator version, for that I'm doing the necessary update here #954, it's done, I just need to fix some golang ci-lint issues and some tests. I'm planning to work on that and have something working by the end of June.
If instead you found something else and you want to make a private disclosure you can write me asap at brokenpip3[at]gmail[dot]com

@brokenpip3 brokenpip3 changed the title fix(backup): find race and disable search in subdir fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x Jun 3, 2024
@brokenpip3
Copy link
Collaborator

I pushed another change: by default during the e2e tests that backup image used will be the latest one build from source.
This will allow us to always test the operator backup operations during a PR/merge.

Beside this, if you can reply to the previous questions the PR looks good to me, I can merge it after that.
(Planning to release a 0.8.1 with this and other changes I made asap)

@brokenpip3 brokenpip3 self-requested a review June 3, 2024 10:02
@brokenpip3
Copy link
Collaborator

Beside this, if you can reply to the previous questions the PR looks good to me, I can merge it after that.
(Planning to release a 0.8.1 with this and other changes I made asap)

@skillcoder gentle ping :)

@brokenpip3
Copy link
Collaborator

I'm going to merge it, thanks for your contribution!

@brokenpip3 brokenpip3 merged commit f98b0bc into jenkinsci:master Jun 22, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants