Unable to connect to AWS ElastiCache Redis after few days of successful working #3086

srikarrampa · 2024-12-17T17:09:31Z

Bug Report

Observing that lettuce client fails to process commands successfully after 3-4 days of successful deployments and working.
The current client options are

TimeoutOptions timeoutOptions =
        TimeoutOptions.builder().fixedTimeout(Duration.ofMillis(1000)).build();

    ClientOptions clientOptions =
        ClientOptions.builder()
            .socketOptions(
                SocketOptions.builder()
                    .connectTimeout(Duration.ofMillis(1000))
                    .build())
            .timeoutOptions(timeoutOptions)
            .autoReconnect(true))
            .suspendReconnectOnProtocolFailure(true)
            .disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
            .build();

    ClientResources clientResources =
        DefaultClientResources.builder()
            .dnsResolver(new DirContextDnsResolver())
            .reconnectDelay(
                Delay.fullJitter(
                    Duration.ofMillis(300),
                    Duration.ofMillis(700),
                    redisConfig.getBase(),
                    TimeUnit.MILLISECONDS))
            .build();
    LettuceClientConfiguration.LettuceClientConfigurationBuilder builder =
        LettuceClientConfiguration.builder()
            .readFrom(ReadFrom.REPLICA_PREFERRED)
            .clientOptions(clientOptions)
            .clientResources(clientResources)
            .redisCredentialsProviderFactory(redisCredentialsProviderFactory());

Current Behavior

Stack trace

java.util.concurrent.CompletionException: io.lettuce.core.RedisCommandExecutionException: WRONGPASS invalid username-password pair or user is disabled.
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
	at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1141)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
	at io.lettuce.core.RedisHandshake.lambda$tryHandshakeResp3$2(RedisHandshake.java:134)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
	at io.lettuce.core.protocol.AsyncCommand.doCompleteExceptionally(AsyncCommand.java:143)
	at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:124)
	at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:115)
	at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:67)
	at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:762)
	at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:697)
	at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:614)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1503)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1366)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1415)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.lettuce.core.RedisCommandExecutionException: WRONGPASS invalid username-password pair or user is disabled.
	at io.lettuce.core.internal.ExceptionFactory.createExecutionException(ExceptionFactory.java:151)
	at io.lettuce.core.internal.ExceptionFactory.createExecutionException(ExceptionFactory.java:120)
	... 29 more

Error message: Disabling autoReconnect due to initialization failure

We see the above error and after that all the requests are rejected. Is there some issue in the configuration settings? The process we follow for creating a connection with AWS ElastiCache is as follows,

Assume a role that provides with SessionToken
Pass that to a Class that extends RedisCredentialsProvider which resolvesCredentials (public Mono resolveCredentials())

Input Code

// your code here;

Expected behavior/code

The reconnection should happen

Environment

Lettuce version(s): 6.5.1.RELEASE
Redis version: 7.1.0

Questions

What is the correct ClientOptions to use in production to allow re-connect. Is it advised to have all three together?
.autoReconnect(true))
.suspendReconnectOnProtocolFailure(true)
.disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
Based on the above settings, reconnect is suspended. how should we recover from this?
Restarting the deployment fixed the issue. However we cannot restart the deployment everytime. Please do provide if any documentation that explains the best practices to connect when using AWS ElastiCache

tishun · 2024-12-20T10:35:11Z

Hello @srikarrampa ,

I do not see something obviously wrong with your set up, from the code samples you have provided.

When reconnecting the driver would attempt to fetch new credentials from the redisCredentialsProviderFactory.
It seems that these new credentials could not be verified by the server and it is returning

WRONGPASS invalid username-password pair or user is disabled.

Have you tried logging the credentials used and trying them out manually with some other client such as redis-cli?

tishun · 2024-12-20T10:40:39Z

Questions

What is the correct ClientOptions to use in production to allow re-connect. Is it advised to have all three together?
.autoReconnect(true))
.suspendReconnectOnProtocolFailure(true)
.disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)

AWS maintains the Amazon ElasiCache so perhaps this question is best suited for them.
Here is their documentation on the subject

Based on the above settings, reconnect is suspended. how should we recover from this?

Reconnect is suspended because the credentials used are wrong.
The driver assumes this is not recoverable so it stops any additional attempts.
I think the proper solutions is to never pass credentials that are wrong in the first place.

Otherwise you can do some manual retry attempt if you listen for such exceptions and reset the connection, but I am not sure this is a valid use case.

Restarting the deployment fixed the issue. However we cannot restart the deployment everytime. Please do provide if any documentation that explains the best practices to connect when using AWS ElastiCache

Restarting is definitely not a solution, but it is a clue to what is wrong - seems like the credentials after restart are different than the ones provided during reconnect. You need to investigate why.

tishun added the status: waiting-for-feedback We need additional information before we can continue label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to connect to AWS ElastiCache Redis after few days of successful working #3086

Unable to connect to AWS ElastiCache Redis after few days of successful working #3086

srikarrampa commented Dec 17, 2024

tishun commented Dec 20, 2024

tishun commented Dec 20, 2024

Questions

Unable to connect to AWS ElastiCache Redis after few days of successful working #3086

Unable to connect to AWS ElastiCache Redis after few days of successful working #3086

Comments

srikarrampa commented Dec 17, 2024

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Questions

tishun commented Dec 20, 2024

tishun commented Dec 20, 2024

Questions