Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]KAFKA-18034: CommitRequestManager should fail pending requests on fatal coordinator errors #18548

Open
wants to merge 10 commits into
base: trunk
Choose a base branch
from

Conversation

m1a2st
Copy link
Contributor

@m1a2st m1a2st commented Jan 15, 2025

Jira: https://issues.apache.org/jira/browse/KAFKA-18034

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@github-actions github-actions bot added triage PRs from the community core Kafka Broker consumer clients labels Jan 15, 2025
@m1a2st m1a2st changed the title KAFKA-18034: CommitRequestManager should fail pending requests on fatal coordinator errors [WIP]KAFKA-18034: CommitRequestManager should fail pending requests on fatal coordinator errors Jan 15, 2025
@m1a2st
Copy link
Contributor Author

m1a2st commented Jan 15, 2025

Hello @lianetm, I will test in this PR

@kirktrue
Copy link
Collaborator

@m1a2st—is there a summary of the issue that caused the previous fix to be reverted? Thanks!

@m1a2st
Copy link
Contributor Author

m1a2st commented Jan 18, 2025

Sorry, I still can't reproduce which test fail in my local machine :(

private void maybePropagateCoordinatorFatalErrorEvent() {
coordinatorRequestManager.fatalError()
.ifPresent(fatalError -> backgroundEventHandler.add(new ErrorEvent(fatalError)));
coordinatorRequestManager.clearFatalError();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kirktrue and @lianetm , I think the main reason is that we should clear the fatal error after we offer exception into background thread. I am still tracing which path would lead test timeout

Copy link
Member

@lianetm lianetm Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I'll take a closer look too asap. In the meantime I re-triggered the build so we can keep validating it.

@github-actions github-actions bot removed the triage PRs from the community label Jan 22, 2025
@lianetm
Copy link
Member

lianetm commented Jan 23, 2025

@m1a2st could you solve the conflicts? Also, just to double check, the missing bit was clearing the fatal error after propagating it to the app thread, no other changes? Thanks!

# Conflicts:
#	core/src/test/scala/integration/kafka/server/QuorumTestHarness.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants