After few days connection stops working on Android #405

IgorCordasProGlove · 2023-04-19T10:45:46Z

Describe the bug

We have an app which runs as a foreground service on Android and should keep persistent connection to IotCore for multiple days/weeks.
Connection is provisioned using fleet provisioning.
We are not sending much data over this connection but it is expected that a standard use case would send 100-1000 messages each day per device.
Above is mostly working but on some devices after a few days (usually 3) we start receiving strange errors and device does not connect properly. This is part of the log that we have to detect this issue:
[2023.03.30 19:20:24.082] I: Publishing to topic <custom_topic>
[2023.03.30 19:21:54.519] I: Cloud connection interrupted: AWS_ERROR_MQTT_UNEXPECTED_HANGUP - The connection was closed unexpectedly.
[2023.03.30 19:21:54.520] I: New cloud connection state: REGISTERED_NOT_CONNECTED
[2023.03.30 19:21:55.887] I: Cloud connection reestablished. Resumed session: true
[2023.03.30 19:21:55.887] I: New cloud connection state: REGISTERED_CONNECTED

The strange thing here is not that connection broke but that from this point in time we start receiving : AWS_ERROR_MQTT_UNEXPECTED_HANGUP in a loop repeating each 90 seconds until we restart the app process.
When connection is in this state we can publish some of the messages but our subscriptions stop working.

We investigated Cloudwatch logs for IOT Core and there is a single error before this connection error loop starts happening:
DUPLICATE_CLIENTID
as per documentation : "The client is using a client ID that is already in use. In this case, the client that is already connected will be disconnected with this disconnect reason."

But the issue here is that we are not using duplicate id anywhere, all the communication for that client id is from a single IP as the logs state and since the process is running constantly it is using a single connection.
After that error we end up in that 90 second re-connection loop on the client.

Above does not happen every time, some devices run for 5+ days without this issue but is reproducible quite often.

Doze mode or battery optimizations seem to have no effect on this, some of the users exclude our app from battery optimization and the issue still starts happening.

Expected Behavior

Connection should be able to run continually (with reconnects) and no 90 second disconnect/re-connect loops should be observed.

Current Behavior

Connection gets into half-broken state and disconnects/re-connects loop indefinitely

Reproduction Steps

We are not able to provide this at current point due to a fact that issue reproduces only on some of the devices

Possible Solution

No response

Additional Information/Context

No response

SDK version used

software.amazon.awssdk.iotdevicesdk:aws-iot-device-sdk-android:1.11.3, software.amazon.awssdk.crt:aws-crt-android:0.16.6

Environment details (OS name and version, etc.)

Android (multiple versions experience the issue)

jmklix · 2023-04-19T16:13:55Z

Thanks for taking the time to open this issue. We don't currently support Android with this sdk, but it is something that we have planned. That being said it would still be useful if you could list what Android devices that you see this on and which versions of Android. Also a minimal code sample that reproduces this would be helpful.

bretambrose · 2023-04-20T01:18:57Z

It's speculation, but the behavior you describe is possible under certain scenarios. I'm copy-pasting a response I gave to an internal query regarding similar behavior. In their case they were publishing a packet that exceeded IoT Core payload size limits, which results in an immediate disconnection. You may not be breaking that limit, but on the chance that you are breaking any packet-based limit that results in a disconnect, this may apply:

The MQTT spec requires a client to retry QoS1+ publishes that were sent but not acknowledged upon resuming a session. (MQTT311) Messages that exceed IoT Core limits cause an immediate connection termination. When you combine these two facts together, you get a "death loop." An "illegal" publish gets tried over and over and the connection gets terminated over and over.

There are number of ways to avoid this, some better than others.

Check the packet against limits before publishing and don't publish it if it will exceed a limit.
Don't use MQTT sessions. May not be an option but it would make the problem disappear if it is.
In the past, we've talked about potentially adding some kind of logic to break "death loops" Maybe a maximum retry count on individual operations. I dislike the way this would mask the underlying problem, but can't deny that it's far better to behave suboptimally then go into a total failure state.

IgorCordasProGlove · 2023-04-21T16:19:11Z

Thanks for the suggestions, this is good to keep in mind since I think parts of our code were not designed for this, I would have expected that (MqttClientConnection.publish(msg)) would throw an exception immediately but it does not so we have to be careful about the payload limit.

I did test this edge case that you mentioned and it can definitely lead to re-connection loop but without the 90 second delays that we are seeing.
In the code when I just do this:
val message ="{test}"
val msg =
if (!once) {
once = true
val random = Random(1)
val bytes = ByteArray(131_072 + 1) { // This is one byte over the Message payload limit of 128KB documented
random.nextInt().toByte()
}
MqttMessage(topic, bytes, QualityOfService.AT_LEAST_ONCE, false)
} else {
MqttMessage(topic, message.toByteArray(), QualityOfService.AT_LEAST_ONCE, false)
}
Timber.d("MESSAGE SENT SIZE : ${msg.payload.size / 1024F}KB")
client.publish(msg) // client is an instance of MqttClientConnection here

connection loop would start happening.

Here Cloudwatch logs give us a clear reason why the error is happening so this is definitely not what we encountered with the multiple days test:
{
"timestamp": "2023-04-21 13:51:18.357",
"logLevel": "ERROR",
"traceId": "xxxxx",
"accountId": "xxxxxx",
"status": "Failure",
"eventType": "Publish-In",
"protocol": "MQTT",
"topicName": "prod/gateway//topic",
"clientId": "xxxxxxxx",
"principalId": "xxxxxxxx",
"sourceIp": "xxx.xxx.xxx.xxx",
"sourcePort": 52348,
"reason": "PAYLOAD_LIMIT_EXCEEDED",
"details": "Message payload exceeds size limit for message type"
}

For the multiple days test we have a single:
"disconnectReason":"DUPLICATE_CLIENTID"
and then after that there are multiple:
"disconnectReason":"MQTT_KEEP_ALIVE_TIMEOUT" errors.

pblitz-pg · 2023-06-26T20:26:44Z

Just to follow up (stepping in for @IgorCordasProGlove), we are still seeing this error. We have been able to build a "minimal" version to reproduce this behavior that is attached here, as well as the logs for one of the app runs.

android_minimal_integration_awsiotcore-main.zip

Conclusion with the test application from above was that there might be some client-side issue within the SDK which leads to this re-connection loop being triggered, we assume this because logs with this minimal app from Cloudwatch don't mention any errors.
The logs for the "minimal" example app can be seen here: logs [05.27.2023] - minimal app.zip

If you want to ran the sample app we provided you will need to go into the AwsService.kt and add some certificate data instead of those TODOs in AwsCredentials object. Whan started app will do the fleet provisioning with your environment and then just repeatedly ping the :
"prod/gateway/$clientId/echo/time" topic, make sure to also change "repeat(1)" to a larger number if you intend to run the test on your side.

ngbahiep · 2023-07-26T07:42:11Z

hi @IgorCordasProGlove ,

I am Hiep. I am using this aws lib to connect to aws mqtt.
my android app works fine on debug mode, however, release mode got problem, it causes my app crash when connect to mqtt broker.

I believe it is because I turn on shrinkResources and minifyEnabled when building release.

Does your app got issue when you build release with shrinkResources and minifyEnabled ?

Thank you,
Hiep

IgorCordasProGlove · 2023-11-29T10:27:43Z

@ngbahiep You can use the debug version of the app, that is how we got it reproduced.

IgorCordasProGlove added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 19, 2023

jmklix added feature-request A feature should be added or improved. p2 This is a standard priority issue and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After few days connection stops working on Android #405

After few days connection stops working on Android #405

IgorCordasProGlove commented Apr 19, 2023

jmklix commented Apr 19, 2023

bretambrose commented Apr 20, 2023

IgorCordasProGlove commented Apr 21, 2023

pblitz-pg commented Jun 26, 2023

ngbahiep commented Jul 26, 2023

IgorCordasProGlove commented Nov 29, 2023

After few days connection stops working on Android #405

After few days connection stops working on Android #405

Comments

IgorCordasProGlove commented Apr 19, 2023

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

jmklix commented Apr 19, 2023

bretambrose commented Apr 20, 2023

IgorCordasProGlove commented Apr 21, 2023

pblitz-pg commented Jun 26, 2023

ngbahiep commented Jul 26, 2023

IgorCordasProGlove commented Nov 29, 2023