Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #678 - Fix single-shot mode always exiting in error #679

Merged
merged 10 commits into from
Feb 7, 2024

Conversation

nicolasochem
Copy link
Contributor

@nicolasochem nicolasochem commented Aug 22, 2023

We added code in #653 to have TRD send a non-zero exit status when anything wrong happens - including payout failures, or (relevant here) keyboard interrupts.

However, it turns out that when using TRD in runmode -2, -3, -4, the regular course of operations is to have the producer thread send a KeyboardInterrupt when done... which is then treated as error!

This is done with _thread.interrupt_main(). It's bad: humans send interrupts, programs should not. Instead, I'm suggesting that the producer use SIGUSR1 to trigger a normal interruption.

This triggers the same code path as today, just with an exit status of SUCCESS. So I expect that the main thread will still wait for the consumer, and it won't actually exit before payouts are done, if there are any.

This allows kubernetes bakers (for example) to use a k8s cronjob and get alerted automatically in case of payment failures.

For the record, I still believe this entire producer/consumer logic should be thrown away and we should simply do things in order (calculate, then pay). I've ranted about this before (#491).

Work effort: 3 hours

We added code in #653 to have TRD send a non-zero exit status when
anything wrong happens - including payout failures, or (relevant here)
keyboard interrupts.

However, it turns out that when using TRD in runmode -2, -3, -4, the
regular course of operations is to have the producer thread send a
KeyboardInterrupt when done... which is then treated as error!

This is done with `_thread.interrupt_main()`. It's bad: humans send
interrupts, programs should not. Instead, I'm suggesting that the
producer use `SIGUSR1` to trigger a normal interruption.

This triggers the same code path as today, just with an exit status of
SUCCESS. So I expect that the main thread will still wait for the
consumer, and it won't actually exit before payouts are done, if there
are any.

I'll deploy in a small baker and see if that's true. I have already
verified that when no payouts are due, exit code is now zero.

For the record, I still believe this entire producer/consumer logic
should be thrown away and we should simply do things in order
(calculate, then pay). I've ranted about this before (#491).
@jmo-staked
Copy link

Hmm there might still be an error in the logic, we tested this with a payout address that had insufficient funds (so we'd expect it to fail), and it did fail and logged that the exit code should be 6 (INSUFFICIENT_FUNDS = 6), but the overall exit code was SUCCESS = 0. Maybe that 6 just isn't bubbling up properly?

2023-08-23 20:11:49,438 - consumer0 - INFO - Total estimated amount to pay out is 2,730,732,539 mutez.
2023-08-23 20:11:49,439 - consumer0 - INFO - 2580 payments will be done in 11 batches
2023-08-23 20:11:49,439 - consumer0 - INFO - Current balance in payout address is 0 mutez.
2023-08-23 20:11:49,440 - consumer0 - ERROR - Payment attempt failed because of insufficient funds in the payout address. The current balance of 0 mutez is insufficient to pay for cycle rewards of 2,730,732,539 mutez.
2023-08-23 20:11:49,440 - consumer0 - INFO - [Plugins] Not sending notification; no plugins enabled
2023-08-23 20:11:49,442 - consumer0 - INFO - Processing completed for 2585 payment items, 2580 failed.
2023-08-23 20:11:49,444 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/done/643.csv'
2023-08-23 20:11:49,452 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/failed/643.csv'
2023-08-23 20:11:49,611 - consumer0 - INFO - Anonymous statistics disabled, (Dry run)
2023-08-23 20:11:49,611 - consumer0 - INFO - Unknown Error at payment consumer. Please consult the verbose logs! Exit code: 6
2023-08-23 20:11:53,046 - producer  - INFO - Reward creation is done for cycle 643, created 3328 rewards.
2023-08-23 20:11:53,047 - producer  - INFO - Run mode ONETIME satisfied. Terminating...
2023-08-23 20:11:53,047 - producer  - INFO - TRD Exit triggered by producer Exit code: 0
2023-08-23 20:11:53,047 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-23 20:11:53,048 - MainThread - INFO - TRD is shutting down...
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Sensitive operations are in progress!
2023-08-23 20:11:53,048 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Shutdown. Exit code: 0

@nicolasochem
Copy link
Contributor Author

@jmo-staked good catch! I just pushed a fix, can you please try again?

@jmo-staked
Copy link

jmo-staked commented Aug 24, 2023

That's looking better, the consumer fails with a 6, the producer exits with a 0, and the app itself exits with a 1. But all that is really critical is the Zero-or-Notzero nature of shell exit codes so this patch looks like it'll work fine for us, thanks again!

2023-08-24 00:48:25,910 - consumer0 - INFO - Anonymous statistics disabled, (Dry run)
2023-08-24 00:48:25,910 - consumer0 - INFO - Unknown Error at payment consumer. Please consult the verbose logs! Exit code: 6
2023-08-24 00:48:29,458 - producer  - INFO - Reward creation is done for cycle 643, created 3328 rewards.
2023-08-24 00:48:29,459 - producer  - INFO - Run mode ONETIME satisfied. Terminating...
2023-08-24 00:48:29,459 - MainThread - INFO - Application stop handler called: 12
2023-08-24 00:48:29,459 - producer  - INFO - TRD Exit triggered by producer Exit code: 0
2023-08-24 00:48:29,459 - MainThread - INFO - TRD is shutting down...
2023-08-24 00:48:29,460 - MainThread - INFO - --------------------------------------------------------
2023-08-24 00:48:29,460 - MainThread - INFO - Sensitive operations are in progress!
2023-08-24 00:48:29,460 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-24 00:48:29,460 - MainThread - INFO - --------------------------------------------------------
2023-08-24 00:48:29,460 - MainThread - INFO - Shutdown due to error! Exit code: 1
$ kubectl describe po ...
State:          Terminated
  Reason:       Error
  Exit Code:    1

@nicolasochem
Copy link
Contributor Author

yes I'm using two signals, SIGUSR1 indicates success and SIGUSR2 is failure... it's fine.

Does it return 0 when payout succeeds?

@jmo-staked
Copy link

We won't have a payout to test that against until cycle 644 completes

@jmo-staked
Copy link

Two more tests passed swimmingly, both with exit code 0:

  • Payout for cycle 644 completed fine
  • Rerun of cycle 644, with a funded wallet, exited 0 with no payout since it had already been completed

@jmo-staked
Copy link

Might have hit one minor other exit error, in this case the MainThread exit code was logged incorrectly:

2023-08-29 16:00:53,401 - MainThread - INFO - [Plugins] No plugins enabled
2023-08-29 16:00:53,402 - MainThread - INFO - Initial cycle set to -1
2023-08-29 16:00:53,411 - MainThread - INFO - Application is READY!
2023-08-29 16:00:53,411 - MainThread - INFO - --------------------------------------------
2023-08-29 16:00:54,489 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.
2023-08-29 16:00:54,490 - consumer0 - WARNING - Exit signal received. Terminating...
2023-08-29 16:00:54,490 - producer  - INFO - TRD Exit triggered by producer Exit code: 8
2023-08-29 16:00:54,490 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-29 16:00:54,491 - MainThread - INFO - TRD is shutting down...
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Sensitive operations are in progress!
2023-08-29 16:00:54,492 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Lock file removed!
2023-08-29 16:00:54,493 - MainThread - INFO - Shutdown. Exit code: 0
$ kubectl describe po ...
    State:          Terminated
      Reason:       Error
      Exit Code:    1

MainThread reports it's exiting with code 0, even though the app is exiting due to error, but the actual exit signal k8s sees is 1 which is good, so the job correctly restarted.

@nicolasochem
Copy link
Contributor Author

Might have hit one minor other exit error, in this case the MainThread exit code was logged incorrectly:

2023-08-29 16:00:53,401 - MainThread - INFO - [Plugins] No plugins enabled
2023-08-29 16:00:53,402 - MainThread - INFO - Initial cycle set to -1
2023-08-29 16:00:53,411 - MainThread - INFO - Application is READY!
2023-08-29 16:00:53,411 - MainThread - INFO - --------------------------------------------
2023-08-29 16:00:54,489 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.
2023-08-29 16:00:54,490 - consumer0 - WARNING - Exit signal received. Terminating...
2023-08-29 16:00:54,490 - producer  - INFO - TRD Exit triggered by producer Exit code: 8
2023-08-29 16:00:54,490 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-29 16:00:54,491 - MainThread - INFO - TRD is shutting down...
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Sensitive operations are in progress!
2023-08-29 16:00:54,492 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Lock file removed!
2023-08-29 16:00:54,493 - MainThread - INFO - Shutdown. Exit code: 0
$ kubectl describe po ...
    State:          Terminated
      Reason:       Error
      Exit Code:    1

MainThread reports it's exiting with code 0, even though the app is exiting due to error, but the actual exit signal k8s sees is 1 which is good, so the job correctly restarted.

I don't think this is true. I think that actually, in case of tzkt unresponsive, the app exited succesfully, which is not what should happen. I have now modified the code to propagate producer errors (unresponsive tzkt, disk full etc)... to the main thread. @jmo-staked @jdsika could you please check urgently?

@jdsika jdsika added the critical Critical label Feb 7, 2024
@jdsika jdsika added this to the v12.0 (Nairobi) milestone Feb 7, 2024
@jdsika jdsika merged commit 587237a into master Feb 7, 2024
10 checks passed
@jdsika jdsika deleted the graceful_exit_onetime branch February 7, 2024 08:37
@nicolasochem
Copy link
Contributor Author

@jdsika pointed out that these signals are not available on windows, more work is needed

https://stackoverflow.com/questions/16429480/what-is-equivalent-to-sigusr1-2-signals-in-windows-using-python

nicolasochem added a commit that referenced this pull request Oct 1, 2024
#679
introduced support for exit codes, so an alert can be sent in
single-shot mode when payouts fail for any reason.

However, it was crude, only supporting exit code 1.

The producer thread supports many exit codes.

In this case, there is a benign issue where tzkt returns "not synced"
and therefore payouts fail, but this is likely temporary and will pass
at next try, so there is no need to alert. But, currently it's not
possible to behave differently based on the exit code because it's
always 0 or 1.

An ugly solution is to save the exit code of the child thread in a file,
then read it in the main thread. That's what I am doing here. I remain
convinced that the entire thread architecture needs to go away, and we
need to make TRD single threaded again, but that's for another day.

Also:
* change the exit code of misconfigured provider to GENERAL_ERROR
  because it's not really a provider error,
* change the help to remove old providers that we don't support anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical Critical
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants