Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend gray failure recentHealthTriggeredRecoveryTime to reflect any recovery trigger #11877

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions fdbclient/include/fdbclient/ServerKnobs.h
Original file line number Diff line number Diff line change
Expand Up @@ -776,8 +776,9 @@ class SWIFT_CXX_IMMORTAL_SINGLETON_TYPE ServerKnobs : public KnobsImpl<ServerKno
double CC_TRACKING_HEALTH_RECOVERY_INTERVAL; // The number of recovery count should not exceed
// CC_MAX_HEALTH_RECOVERY_COUNT within
// CC_TRACKING_HEALTH_RECOVERY_INTERVAL.
int CC_MAX_HEALTH_RECOVERY_COUNT; // The max number of recoveries can be triggered due to worker health within
// CC_TRACKING_HEALTH_RECOVERY_INTERVAL
int CC_MAX_HEALTH_RECOVERY_COUNT; // The max number recoveries that can be triggered due to worker
// health within CC_TRACKING_HEALTH_RECOVERY_INTERVAL. This count accounts for any
// recovery trigger including non-gray failure ones.
bool CC_HEALTH_TRIGGER_FAILOVER; // Whether to enable health triggered failover in CC.
int CC_FAILOVER_DUE_TO_HEALTH_MIN_DEGRADATION; // The minimum number of degraded servers that can trigger a
// failover.
Expand Down
2 changes: 1 addition & 1 deletion fdbserver/ClusterController.actor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,7 @@ ACTOR Future<Void> clusterWatchDatabase(ClusterControllerData* cluster,

collection = actorCollection(db->recoveryData->addActor.getFuture());
recoveryCore = clusterRecoveryCore(db->recoveryData);
cluster->recentHealthTriggeredRecoveryTime.push(now());

// Master failure detection is pretty sensitive, but if we are in the middle of a very long recovery we
// really don't want to have to start over
Expand Down Expand Up @@ -3061,7 +3062,6 @@ ACTOR Future<Void> workerHealthMonitor(ClusterControllerData* self) {
if (self->shouldTriggerRecoveryDueToDegradedServers()) {
if (SERVER_KNOBS->CC_HEALTH_TRIGGER_RECOVERY) {
if (self->recentRecoveryCountDueToHealth() < SERVER_KNOBS->CC_MAX_HEALTH_RECOVERY_COUNT) {
self->recentHealthTriggeredRecoveryTime.push(now());
self->excludedDegradedServers.clear();
for (const auto& degradedServer : self->degradationInfo.degradedServers) {
self->excludedDegradedServers[degradedServer] = now();
Expand Down