fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

skillcoder · 2024-05-08T08:59:12Z

Changes

fix 2 problems after merge #1000

unexpected jenkins restarts because of race in find (read dir and stat of file) (avoid searching in subfolders)
incorrect backup num determine, because search also in subfolders (avoid searching in subfolders)
always test latest backup image in e2e

Also following changes in run.sh

refactor required ENV checks
change backup container logs
- add warning on undefined BACKUP_COUNT on startup
- show backup cleanup interval on startup
- show logs only on removing backups
- show removed backup files names
extract list of exceeding backups into function
change backup cleanup interval 10s -> 60s
speedup tests by introducing BACKUP_CLEANUP_INTERVAL

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide for more details.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS.

Release Notes

- Bug fixes
  - avoid unexpected jenkins restarts due to race in `find` in backup cleanup `run.sh` script
  - proper determine latest backup number (do not search in /backup subfolders)
- Any changes in behavior
  - now we do not search for backups in subfolders
  - backup container produce less logs, only valuable information
  - backup container logs now consist list of each deleted backups
  - backup cleanup interval changed from 10s to 60s
  - added warning in case of missing BACKUP_COUNT ENV

skillcoder · 2024-05-08T09:02:33Z

@brokenpip3 Could you please take a look.
Yesterday I've noticed my changes in #1000 introduced 1 issue and one bug.
This is a fix for these issues and small refactoring + implemented TODO

Probably I also need to add a tests for these cases.

brokenpip3 · 2024-05-08T10:42:18Z

I will take a look as soon it will be possible, but before that: in PR #1000 you only changed the make tmp dir location, nothing else, why it could be an issue with find and max depth?
Again I still need to check the code

brokenpip3 · 2024-05-08T10:43:26Z

(In general: don't worry it's not the default version in the chart/code, I was planning to ship it with a 0.8.1 version before merging the 0.9.0 PR that is ready, just need to fix the tests)

skillcoder · 2024-05-08T11:06:11Z

Let me explain the find race problem and why it is directly related to changes in #1000.

TL;DR; After #1000 temp directory and new backup archives creating in /backup subfolders which is directly affected find behavior in get-latest.sh and run.sh, also since in run.sh we use set -eo pipefail any non-zero exit code cause container restart and operator recreate the pod with jenkins master on every such restart.

find command by design have race in case of directory removed during find works, this is because find read directory and after try to stat files in the dir, and if this dir removed in the middle, find show error and at the end return non-zero exit code.
And actually temp directories could be removed in parallel with cleanup process.

Also I found another critical problem related to incomplete backup, in case of jenkins pods restarts current backup could be left in temp folder and get-latest.sh will use this incomplete backup filename to restore backup during jenkins startup.

This PR fix these 2 problems, but still need to implement automatic cleanup for incomplete backups.

Also seems I found critical security problem in operator release pipeline.

brokenpip3 · 2024-05-08T16:47:19Z

After #1000 temp directory and new backup archives creating in /backup subfolders which is directly affected find behavior in get-latest.sh and run.sh, also since in run.sh we use set -eo pipefail any non-zero exit code cause container restart and operator recreate the pod with jenkins master on every such restart.

yes it make sense indeed, I remember we limited the find in the past with maxdepth, not sure why it's disappeared (will check the commits).
Thanks a lot for this and for fixing the tests, however this week and the next one it will be hard for me to digest the changes and review the PR. Will try my best

skillcoder · 2024-05-08T21:13:24Z

Here the scripts to reproduce race condition in find which we had in our production.
Need to run them in parallel
del_loop.sh

#!/usr/bin/env bash

set -eo pipefail

BACKUP_DIR=./backup

while true;
do
  BACKUP_TMP_DIR=$(mktemp -d --tmpdir="${BACKUP_DIR}")
  touch "${BACKUP_TMP_DIR}/6.tar.zstd"
  rm -rf "${BACKUP_TMP_DIR}"
done

find.sh

#!/usr/bin/env bash

set -eo pipefail

BACKUP_DIR=./backup
BACKUP_COUNT=3

touch "${BACKUP_DIR}"/1.tar.zstd
touch "${BACKUP_DIR}"/2.tar.zstd
touch "${BACKUP_DIR}"/3.tar.zstd
touch "${BACKUP_DIR}"/4.tar.zstd
touch "${BACKUP_DIR}"/5.tar.zstd

while true;
do
  find ${BACKUP_DIR} -maxdepth 1 -name '*.tar.zstd' -exec basename {} \; | sort -gr | tail -n +$((BACKUP_COUNT +1))
  echo
done

Probably need to add such test, but I'm not sure how to run several process in parallel, probably it is possible only with &

brokenpip3 · 2024-05-12T09:19:48Z

Probably need to add such test, but I'm not sure how to run several process in parallel, probably it is possible only with &

we can use gnu parallel, however we have several test around the backup in multiple forms, so I'm satisfied around the current coverage.

brokenpip3

Thanks for your contribution 🙇

brokenpip3 · 2024-05-12T09:30:17Z

There were some issues in the tests, will take a look

brokenpip3 · 2024-06-03T08:30:33Z

Sorry I lost this 2 comments:

This PR fix these 2 problems, but still need to implement automatic cleanup for incomplete backups.

what you mean? we have this trap for that

Also seems I found critical security problem in operator release pipeline.

what you mean? you mean the vulnerabilities that the latest version of the operator contains? those are related to golang and operator version, for that I'm doing the necessary update here #954, it's done, I just need to fix some golang ci-lint issues and some tests. I'm planning to work on that and have something working by the end of June.
If instead you found something else and you want to make a private disclosure you can write me asap at brokenpip3[at]gmail[dot]com

brokenpip3 · 2024-06-03T10:00:34Z

I pushed another change: by default during the e2e tests that backup image used will be the latest one build from source.
This will allow us to always test the operator backup operations during a PR/merge.

Beside this, if you can reply to the previous questions the PR looks good to me, I can merge it after that.
(Planning to release a 0.8.1 with this and other changes I made asap)

brokenpip3 · 2024-06-11T21:59:36Z

Beside this, if you can reply to the previous questions the PR looks good to me, I can merge it after that.
(Planning to release a 0.8.1 with this and other changes I made asap)

@skillcoder gentle ping :)

brokenpip3 · 2024-06-22T07:46:59Z

I'm going to merge it, thanks for your contribution!

fix(backup): find race and disable search in subdir

fbfcfe9

skillcoder added 2 commits May 8, 2024 11:26

fix and improve backup related tests

1e03955

fix is_backup_not_exist

6e1099b

handle empty backup dir cleanup

66136e4

skillcoder added 3 commits May 8, 2024 16:28

show container logs in limit_backup_count_no_backups test

702a911

fix find_exceeding_backups

6349e1d

fix warning in find_exceeding_backups

60e297d

bump version, shellcheck, cosmetic changes

1ecd21e

brokenpip3 approved these changes May 12, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into review-1011

149cb16

brokenpip3 added 2 commits June 3, 2024 11:06

bump minor version of backup img after this change

965d0fd

always test latest backup in e2e

86d4155

brokenpip3 changed the title ~~fix(backup): find race and disable search in subdir~~ fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x Jun 3, 2024

brokenpip3 self-requested a review June 3, 2024 10:02

brokenpip3 merged commit f98b0bc into jenkinsci:master Jun 22, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

skillcoder commented May 8, 2024 •

edited by brokenpip3

Loading

skillcoder commented May 8, 2024

brokenpip3 commented May 8, 2024

brokenpip3 commented May 8, 2024

skillcoder commented May 8, 2024

brokenpip3 commented May 8, 2024

skillcoder commented May 8, 2024

brokenpip3 commented May 12, 2024

brokenpip3 left a comment

brokenpip3 commented May 12, 2024

brokenpip3 commented Jun 3, 2024 •

edited

Loading

brokenpip3 commented Jun 3, 2024

brokenpip3 commented Jun 11, 2024

brokenpip3 commented Jun 22, 2024

fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

fix(backup): find race and disable search in subdir, always test latest backup image in e2e, bump backup to 0.3.x #1011

Conversation

skillcoder commented May 8, 2024 • edited by brokenpip3 Loading

Changes

Submitter Checklist

Reviewer Notes

Release Notes

skillcoder commented May 8, 2024

brokenpip3 commented May 8, 2024

brokenpip3 commented May 8, 2024

skillcoder commented May 8, 2024

brokenpip3 commented May 8, 2024

skillcoder commented May 8, 2024

brokenpip3 commented May 12, 2024

brokenpip3 left a comment

Choose a reason for hiding this comment

brokenpip3 commented May 12, 2024

brokenpip3 commented Jun 3, 2024 • edited Loading

brokenpip3 commented Jun 3, 2024

brokenpip3 commented Jun 11, 2024

brokenpip3 commented Jun 22, 2024

skillcoder commented May 8, 2024 •

edited by brokenpip3

Loading

brokenpip3 commented Jun 3, 2024 •

edited

Loading