fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

vincetran · 2025-01-14T22:48:12Z

PR title and description conform to Pull Request guidelines.

Issue #, if available: N/A

Problem: We found that with multi-part uploads that took a long time (e.g. a large file being uploaded or a slow network connection) would eventually cause SocketTimeoutExceptions. After a long investigation, we found that the issue is with creating a lot of CoroutineWorkers and queuing them up at the same time. Since all of the pending part uploads were being queued at the same time, WorkManager would actually queue them ALL of them at the same time because the S3 uploadPart call is a suspending function. So a part upload may "start" at T:0, get suspended because 20 other parts are trying to upload at the same time, and then not actually continue until 5 minutes later. And since the socket connection was already open, it would timeout because nothing got sent.

Description of change:
The fix to this was to change the underlying Worker for the PartUploader from a CoroutineWorker (suspending) to a plain Worker (blocking).

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Had to get creative with abstraction because of the RouterWorker that's being used to route a work item into the appropriate subclassed *Worker. BaseTransferWorker got converted from an abstract class to an interface and then logic got moved to abstract classes SuspendingTransferWorker and BlockingTransferWorker as appropriate.

How did you test these changes?
Besides testing multi-part, I regression tested all of the child Workers that got touched. Test cases:

Aborting an upload
Uploading a single part upload (i.e. less than 5MB file)
Downloading a file
Closing the app and reopening the app for all of the above cases to make sure WorkManager could continue pending Work.

Documentation update required?

No
Yes (Please include a PR link for the documentation update)

General Checklist

Added Unit Tests
Added Integration Tests
Security oriented best practices and standards are followed (e.g. using input sanitization, principle of least privilege, etc)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…part upload

tylerjroach · 2025-01-15T15:00:18Z

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Could you verify that if, as an app developer, you increase number of allowed workers, it will increase number of parts uploaded at a time? Making sure no speed decreases happen as a result, but otherwise this change seems good!

vincetran · 2025-01-15T21:08:40Z

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Could you verify that if, as an app developer, you increase number of allowed workers, it will increase number of parts uploaded at a time? Making sure no speed decreases happen as a result, but otherwise this change seems good!

Yeah, so to increase the number of concurrent parts being uploaded, the app developer needs to increase the number of threads like so:

WorkManager.initialize(
    context,
    Configuration.Builder()
         // Uses a fixed thread pool of size 8 threads.
        .setExecutor(Executors.newFixedThreadPool(8))
        .build())

The default number of threads is the number of cores the device has minus 1 (or 4 or 2 lol). From the WorkManger Configuration class:

private @NonNull Executor createDefaultExecutor(boolean isTaskExecutor) {
        return Executors.newFixedThreadPool(
            // This value is the same as the core pool size for AsyncTask#THREAD_POOL_EXECUTOR.
            Math.max(2, Math.min(Runtime.getRuntime().availableProcessors() - 1, 4)),
            createDefaultThreadFactory(isTaskExecutor));
}

So when I bumped the configuration to use 6 threads, I got 6 concurrent uploads with no degredation even in the same network conditions.

fix(storage): Fix SocketTimeoutException when executing a long multi-…

82d5357

…part upload

vincetran requested a review from a team as a code owner January 14, 2025 22:48

tylerjroach approved these changes Jan 15, 2025

View reviewed changes

vincetran added 2 commits January 15, 2025 13:08

Merge branch 'main' into vincetran/multi-part-upload

4497b02

Merge branch 'main' into vincetran/multi-part-upload

07e1fc7

vincetran merged commit 7581848 into main Jan 15, 2025
3 checks passed

vincetran deleted the vincetran/multi-part-upload branch January 15, 2025 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

vincetran commented Jan 14, 2025

tylerjroach commented Jan 15, 2025

vincetran commented Jan 15, 2025 •

edited

Loading

fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

Conversation

vincetran commented Jan 14, 2025

tylerjroach commented Jan 15, 2025

vincetran commented Jan 15, 2025 • edited Loading

vincetran commented Jan 15, 2025 •

edited

Loading