-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Ra server resilience when log infrastructure encounters faults #428
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kjnilsson
force-pushed
the
log-improvements
branch
from
April 11, 2024 13:57
55d6bba
to
5beb361
Compare
@kjnilsson you can rebase now that #431 is merged |
kjnilsson
force-pushed
the
log-improvements
branch
from
April 23, 2024 18:52
75f40d4
to
41d86a8
Compare
… encounter faults. In particular there are many improvements and fixes relating to the server -> wal resend protocol including: Bug fix to ra_log_cache that would cause most triggered resends result in a ra process crash. Dropping fewer messages using the gen_state postpone feature. Ra leaders would previously just exit with wal_down - now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer process Improved detection and availability when a command is lost on the way to the wal and no further commands are sent. Also there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options: undefined : do not restart any ra server registered: restart all locally registered servers for the system mfa: call a custom function that performs the restart. This feature will allow dynamically started ra server to be restarted should the ra system crash and restart. Also improvements to code coverage and refactoring. improvements to data safety when log infra crashes.
kjnilsson
force-pushed
the
log-improvements
branch
from
April 24, 2024 14:45
2880733
to
80d041c
Compare
michaelklishin
approved these changes
Apr 24, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed some mostly cosmetic changes:
- Less logging
- Clarified a few comments
michaelklishin
changed the title
ra_log fault resilience and other fixes
Improves log write failure resilience and other fixes
Apr 24, 2024
kjnilsson
changed the title
Improves log write failure resilience and other fixes
Improve ra server resilience when log infrastructure experiences faults
Apr 25, 2024
kjnilsson
changed the title
Improve ra server resilience when log infrastructure experiences faults
Improve Ra server resilience when log infrastructure experiences faults
Apr 25, 2024
kjnilsson
changed the title
Improve Ra server resilience when log infrastructure experiences faults
Improve Ra server resilience when log infrastructure encounters faults
Apr 25, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Various improvements to data safety when log infrastructure processes encounter faults.
In particular there are many improvements and fixes relating to the server -> wal resend protocol including:
wal_down
- now they enter the same await_condition state although with a shorter timeout after which the begin a leader transfer processAlso there is a new feature to configure on a per system basis what kind of server recovery should take place when a ra system starts/restarts. There are 3 options:
undefined
: do not restart any ra serverregistered
: restart all locally registered servers for the systemmfa
: call a custom function that performs the restart.This feature will allow dynamically started ra server to be restarted should the ra system crash and restart.
Also improvements to code coverage and refactoring.
Fixes: #416