Additional logging for CI #1156

ikreymer · 2023-09-08T16:51:39Z

Additional logging for debugging failed crawls on CI:

Print the previous container logs for failed pods
Also print pod status

add redis_storage param

set 'max_crawl_scale' in values.yaml to indicate max possible scale, used to create crawl-instance-{0, N} priority classes, each with lower priority allows crawl instance 0 to preempt crawls with more instances (and lower priorities) eg. 2nd instance of a crawl can preempt 3rd instance of another, and a new crawl (1st instance) can preempt 2nd instance of another crawl

- ensure redis pod is deleted last - start deletion in background as soon as crawl is done - operator may call finalizer with old state: if not finished but in finalizer, attempt to cancel, and throw 400 if already canceled - recreate redis in finalizer from yaml to avoid change event

- support reconciling desired and actual scale - if desired scale is lower, attempt to gracefully shutdown each instance via new redis 'stopone' key - once each instance above > desired scale exit successfully, adjust the status.scale down to clean up pods. also clean up redis per-instance state when scaling down

…have been running for >60 seconds, not immediately

…ulset

add placeholder for adding podmetrics as related resources fix canceled condition

- async add_crawl_errors_to_db() call creates its own redis connection, as other one is supposed to be closed by caller - remove unneeded 'sync_db_state_if_finished' - delete job after crawl finished tasks - log if crawl finished but not yet deleted on next update

- pods explicitly deleted if spec.restartTime != status.restartTime, then updates status.restartTime - use force_restart to remove pods for one sync response to force deletion - update to latest metacontroller v4.11.0 - add --restartOnError flag for crawler

Co-authored-by: Tessa Walsh <[email protected]>

cancel crawl test: just wait until page is found, not necessarily done

- print previous log - print pod statuses for failed crawls

to ensure get original log (previous container logs may not be available)

if logging lines/no restart mode, fail crawl when pod fails

…y handled via mark_finished

…d failed! disable disk_utiliazation_threshold to avoid unnecessary interrupts

ikreymer · 2023-09-11T15:57:44Z

These changes / branch has been merged into #1149 changes

ikreymer and others added 17 commits September 1, 2023 00:27

convert operator to control pods/pvcs directly

deed92d

add redis_storage param

update templates

cf5c194

finalizing: first remove crawler pods, then redis, then all pvcs

ad6ffce

templates: add required tolerations to templates to avoid recreate loop

278fb82

redis: pause redis container (set initRedis to false) if no crawlers …

d7202a6

…have been running for >60 seconds, not immediately

profilebrowser: update operator to manage pod directly without statef…

51aa164

…ulset

resource usage: add resources to crawljob status

59052e2

add placeholder for adding podmetrics as related resources fix canceled condition

cleanup:

b95365c

- async add_crawl_errors_to_db() call creates its own redis connection, as other one is supposed to be closed by caller - remove unneeded 'sync_db_state_if_finished' - delete job after crawl finished tasks - log if crawl finished but not yet deleted on next update

Merge branch 'main' into op-pod

5302fad

chart: set redis storage default to 3Gi

d356b14

Merge branch 'main' into op-pod

0b49bbc

Update backend/btrixcloud/operator.py

0c6f24c

Co-authored-by: Tessa Walsh <[email protected]>

Update backend/btrixcloud/operator.py

fb95872

Co-authored-by: Tessa Walsh <[email protected]>

ikreymer requested a review from tw4l September 8, 2023 16:51

ikreymer added 3 commits September 8, 2023 10:12

improve logging for tests

263c5e2

cancel crawl test: just wait until page is found, not necessarily done

more failed pod logging:

31db656

- print previous log - print pod statuses for failed crawls

debug: if logging last lines of failed container, don't restart

b2856c8

to ensure get original log (previous container logs may not be available)

ikreymer force-pushed the more-failed-logging branch from 139ec67 to b2856c8 Compare September 8, 2023 17:25

ikreymer added 8 commits September 8, 2023 14:10

ci: disable frontend, use 1 browser instance

947a776

enable node port on backend if no frontend

8c49125

reenable frontend

180764d

remove disabling frontend

4b2b629

undo

a277ee3

reset

799d0cb

need at least 1 done page

68ad7eb

tweak tests

44b690c

ikreymer added 13 commits September 8, 2023 20:48

exclude community page

3640509

set auto_add crawl limit to 1

f23cdf6

use dev image

1bf2a04

refactor: use cancel_crawl() for canceled/failed crawls

e8a1fc5

if logging lines/no restart mode, fail crawl when pod fails

tests: don't run in loop if test fails, catch log exceptions

184db85

cleanup, replace cancel_crawl() with fail_crawl(), cancel crawl simpl…

4406656

…y handled via mark_finished

print redis errors also

c3b60ac

fix logging

73a1ce7

fix lint

bd4d654

keep OnFailure restart policy, interrupted pods incorrectly considere…

4943af5

…d failed! disable disk_utiliazation_threshold to avoid unnecessary interrupts

test default settings?

845caec

test with latest released image

f4827c3

remove unused logging

610d0af

ikreymer closed this Sep 11, 2023

ikreymer deleted the more-failed-logging branch September 11, 2023 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional logging for CI #1156

Additional logging for CI #1156

ikreymer commented Sep 8, 2023

ikreymer commented Sep 11, 2023

Additional logging for CI #1156

Additional logging for CI #1156

Conversation

ikreymer commented Sep 8, 2023

ikreymer commented Sep 11, 2023