DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

twthorn · 2023-09-18T21:54:43Z

Add case to retry on NOT_FOUND status code for down/nonexistent. Also add unit test to make this testable for vitess error handler (and backfill other test cases).

We want to retry to connect in case that tablet is temporarily down/nonexistent, i.e., retry for this exception:

2023-09-15 20:01:22,367 ERROR  ||  WorkerSourceTask{id=byuser-connector-23} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted   [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
        at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:72)
        at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:78)
        at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:29)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:205)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:172)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:118)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: tablet: cell:"us_east_1e" uid:300240074 is either down or nonexistent
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463)
        at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:616)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:69)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:802)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:781)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        ... 3 more
2023-09-15 20:01:22,367 INFO   ||  Stopping down connector   [io.debezium.connector.common.BaseSourceTask]

twthorn · 2023-09-18T23:28:36Z

One question: would we ever want to flip to an exclude list (i.e., return true by default, and only not retry on certain exceptions)? Seems likely that as users discover more scenarios/edge cases from vitess side, we keep expanding this as previous PRs have done.

jpechane · 2023-09-19T07:34:59Z

@twthorn Applied, thanks. WRT change of the semantics I am fine with that, we are retrying be default for most of the connectors. So if the default behvaiour should be retry then it makes sense to switch the logic.

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet

44d549f

jpechane merged commit a3a6290 into debezium:main Sep 19, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

twthorn commented Sep 18, 2023

twthorn commented Sep 18, 2023

jpechane commented Sep 19, 2023

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

DBZ-6939 Retry on NOT_FOUND status code for down/nonexistent tablet #157

Conversation

twthorn commented Sep 18, 2023

twthorn commented Sep 18, 2023

jpechane commented Sep 19, 2023