Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

twthorn
Copy link
Contributor

@twthorn twthorn commented Sep 18, 2023

Add case to retry on NOT_FOUND status code for down/nonexistent. Also add unit test to make this testable for vitess error handler (and backfill other test cases).

We want to retry to connect in case that tablet is temporarily down/nonexistent, i.e., retry for this exception:

2023-09-15 20:01:22,367 ERROR  ||  WorkerSourceTask{id=byuser-connector-23} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted   [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
        at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:72)
        at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:78)
        at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:29)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:205)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:172)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:118)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: tablet: cell:"us_east_1e" uid:300240074 is either down or nonexistent
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463)
        at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:616)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:69)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:802)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:781)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        ... 3 more
2023-09-15 20:01:22,367 INFO   ||  Stopping down connector   [io.debezium.connector.common.BaseSourceTask] 

@twthorn
Copy link
Contributor Author

twthorn commented Sep 18, 2023

One question: would we ever want to flip to an exclude list (i.e., return true by default, and only not retry on certain exceptions)? Seems likely that as users discover more scenarios/edge cases from vitess side, we keep expanding this as previous PRs have done.

@jpechane jpechane merged commit a3a6290 into debezium:main Sep 19, 2023
4 checks passed
@jpechane
Copy link
Contributor

@twthorn Applied, thanks. WRT change of the semantics I am fine with that, we are retrying be default for most of the connectors. So if the default behvaiour should be retry then it makes sense to switch the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants