-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #678 - Fix single-shot mode always exiting in error #679
Conversation
We added code in #653 to have TRD send a non-zero exit status when anything wrong happens - including payout failures, or (relevant here) keyboard interrupts. However, it turns out that when using TRD in runmode -2, -3, -4, the regular course of operations is to have the producer thread send a KeyboardInterrupt when done... which is then treated as error! This is done with `_thread.interrupt_main()`. It's bad: humans send interrupts, programs should not. Instead, I'm suggesting that the producer use `SIGUSR1` to trigger a normal interruption. This triggers the same code path as today, just with an exit status of SUCCESS. So I expect that the main thread will still wait for the consumer, and it won't actually exit before payouts are done, if there are any. I'll deploy in a small baker and see if that's true. I have already verified that when no payouts are due, exit code is now zero. For the record, I still believe this entire producer/consumer logic should be thrown away and we should simply do things in order (calculate, then pay). I've ranted about this before (#491).
Hmm there might still be an error in the logic, we tested this with a payout address that had insufficient funds (so we'd expect it to fail), and it did fail and logged that the exit code should be 6 ( 2023-08-23 20:11:49,438 - consumer0 - INFO - Total estimated amount to pay out is 2,730,732,539 mutez.
2023-08-23 20:11:49,439 - consumer0 - INFO - 2580 payments will be done in 11 batches
2023-08-23 20:11:49,439 - consumer0 - INFO - Current balance in payout address is 0 mutez.
2023-08-23 20:11:49,440 - consumer0 - ERROR - Payment attempt failed because of insufficient funds in the payout address. The current balance of 0 mutez is insufficient to pay for cycle rewards of 2,730,732,539 mutez.
2023-08-23 20:11:49,440 - consumer0 - INFO - [Plugins] Not sending notification; no plugins enabled
2023-08-23 20:11:49,442 - consumer0 - INFO - Processing completed for 2585 payment items, 2580 failed.
2023-08-23 20:11:49,444 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/done/643.csv'
2023-08-23 20:11:49,452 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/failed/643.csv'
2023-08-23 20:11:49,611 - consumer0 - INFO - Anonymous statistics disabled, (Dry run)
2023-08-23 20:11:49,611 - consumer0 - INFO - Unknown Error at payment consumer. Please consult the verbose logs! Exit code: 6
2023-08-23 20:11:53,046 - producer - INFO - Reward creation is done for cycle 643, created 3328 rewards.
2023-08-23 20:11:53,047 - producer - INFO - Run mode ONETIME satisfied. Terminating...
2023-08-23 20:11:53,047 - producer - INFO - TRD Exit triggered by producer Exit code: 0
2023-08-23 20:11:53,047 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-23 20:11:53,048 - MainThread - INFO - TRD is shutting down...
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Sensitive operations are in progress!
2023-08-23 20:11:53,048 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Shutdown. Exit code: 0 |
@jmo-staked good catch! I just pushed a fix, can you please try again? |
That's looking better, the
|
yes I'm using two signals, SIGUSR1 indicates success and SIGUSR2 is failure... it's fine. Does it return 0 when payout succeeds? |
We won't have a payout to test that against until cycle 644 completes |
Two more tests passed swimmingly, both with exit code
|
Might have hit one minor other exit error, in this case the 2023-08-29 16:00:53,401 - MainThread - INFO - [Plugins] No plugins enabled
2023-08-29 16:00:53,402 - MainThread - INFO - Initial cycle set to -1
2023-08-29 16:00:53,411 - MainThread - INFO - Application is READY!
2023-08-29 16:00:53,411 - MainThread - INFO - --------------------------------------------
2023-08-29 16:00:54,489 - producer - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.
2023-08-29 16:00:54,490 - consumer0 - WARNING - Exit signal received. Terminating...
2023-08-29 16:00:54,490 - producer - INFO - TRD Exit triggered by producer Exit code: 8
2023-08-29 16:00:54,490 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-29 16:00:54,491 - MainThread - INFO - TRD is shutting down...
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Sensitive operations are in progress!
2023-08-29 16:00:54,492 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Lock file removed!
2023-08-29 16:00:54,493 - MainThread - INFO - Shutdown. Exit code: 0
|
I don't think this is true. I think that actually, in case of tzkt unresponsive, the app exited succesfully, which is not what should happen. I have now modified the code to propagate producer errors (unresponsive tzkt, disk full etc)... to the main thread. @jmo-staked @jdsika could you please check urgently? |
@jdsika pointed out that these signals are not available on windows, more work is needed |
#679 introduced support for exit codes, so an alert can be sent in single-shot mode when payouts fail for any reason. However, it was crude, only supporting exit code 1. The producer thread supports many exit codes. In this case, there is a benign issue where tzkt returns "not synced" and therefore payouts fail, but this is likely temporary and will pass at next try, so there is no need to alert. But, currently it's not possible to behave differently based on the exit code because it's always 0 or 1. An ugly solution is to save the exit code of the child thread in a file, then read it in the main thread. That's what I am doing here. I remain convinced that the entire thread architecture needs to go away, and we need to make TRD single threaded again, but that's for another day. Also: * change the exit code of misconfigured provider to GENERAL_ERROR because it's not really a provider error, * change the help to remove old providers that we don't support anymore
We added code in #653 to have TRD send a non-zero exit status when anything wrong happens - including payout failures, or (relevant here) keyboard interrupts.
However, it turns out that when using TRD in runmode -2, -3, -4, the regular course of operations is to have the producer thread send a KeyboardInterrupt when done... which is then treated as error!
This is done with
_thread.interrupt_main()
. It's bad: humans send interrupts, programs should not. Instead, I'm suggesting that the producer useSIGUSR1
to trigger a normal interruption.This triggers the same code path as today, just with an exit status of SUCCESS. So I expect that the main thread will still wait for the consumer, and it won't actually exit before payouts are done, if there are any.
This allows kubernetes bakers (for example) to use a k8s cronjob and get alerted automatically in case of payment failures.
For the record, I still believe this entire producer/consumer logic should be thrown away and we should simply do things in order (calculate, then pay). I've ranted about this before (#491).
Work effort: 3 hours