Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quorum Reader - Include Primary to Meet Quorum when One Secondary Replica is Non-Responsive #4440

Open
kundadebdatta opened this issue Apr 18, 2024 · 1 comment · May be fixed by #4970
Open
Assignees
Labels
bug Something isn't working

Comments

@kundadebdatta
Copy link
Member

kundadebdatta commented Apr 18, 2024

Problem:

With our Bounded Staleness consistency settings, where it normally reads data from two secondary replicas and chooses the most recent version. It then verifies consistency by sending "head requests" to all secondaries (and the primary if the replica set is less than 4).

Recently, during a deployment, one secondary became unavailable due to the update, and another crashed. This left us with a replica set of only three (two secondaries and the primary).

The system attempted to read from the two remaining secondaries, but one was unreachable due to the crash. This triggered a validation check, which normally would involve reading from the primary if no data was retrieved from the secondaries. However, in this case, the validation logic prevented reading from the primary because the replica set size was 3 (it expected at least 2 responses for a quorum).

This validation failure caused an exception and retries, but it didn't resolve the issue.

Proposed Solution:

To avoid this issue, we propose modifying the system's behavior when the replica set size is reduced and one secondary is unavailable. Instead of requiring a quorum from the remaining secondaries, we would include the primary in the selection process. This would allow the system to read from all available replicas and establish consistency. This change would ensure the system remains operational even during similar failures.

@kundadebdatta kundadebdatta added the bug Something isn't working label Apr 18, 2024
@kundadebdatta kundadebdatta self-assigned this Apr 18, 2024
@kundadebdatta kundadebdatta moved this to Approved in Azure Cosmos SDKs Apr 18, 2024
@FabianMeiswinkel
Copy link
Member

When fixing this please also make the error messages of the GoneExceptions more useful (currently PartitionId is missing for example)

https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB?path=/Product/Microsoft.Azure.Documents/SharedFiles/QuorumReader.cs&version=GBmaster&line=139&lineEnd=144&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents

@NaluTripician NaluTripician self-assigned this Jun 18, 2024
@kundadebdatta kundadebdatta moved this from Approved to In Progress in Azure Cosmos SDKs Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants