Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973

Merged
merged 3 commits into from
Jan 15, 2025

Conversation

vincetran
Copy link
Member

  • PR title and description conform to Pull Request guidelines.

Issue #, if available: N/A

Problem: We found that with multi-part uploads that took a long time (e.g. a large file being uploaded or a slow network connection) would eventually cause SocketTimeoutExceptions. After a long investigation, we found that the issue is with creating a lot of CoroutineWorkers and queuing them up at the same time. Since all of the pending part uploads were being queued at the same time, WorkManager would actually queue them ALL of them at the same time because the S3 uploadPart call is a suspending function. So a part upload may "start" at T:0, get suspended because 20 other parts are trying to upload at the same time, and then not actually continue until 5 minutes later. And since the socket connection was already open, it would timeout because nothing got sent.

Description of change:
The fix to this was to change the underlying Worker for the PartUploader from a CoroutineWorker (suspending) to a plain Worker (blocking).

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Had to get creative with abstraction because of the RouterWorker that's being used to route a work item into the appropriate subclassed *Worker. BaseTransferWorker got converted from an abstract class to an interface and then logic got moved to abstract classes SuspendingTransferWorker and BlockingTransferWorker as appropriate.

How did you test these changes?
Besides testing multi-part, I regression tested all of the child Workers that got touched. Test cases:

  1. Aborting an upload
  2. Uploading a single part upload (i.e. less than 5MB file)
  3. Downloading a file
  4. Closing the app and reopening the app for all of the above cases to make sure WorkManager could continue pending Work.

Documentation update required?

  • No
  • Yes (Please include a PR link for the documentation update)

General Checklist

  • Added Unit Tests
  • Added Integration Tests
  • Security oriented best practices and standards are followed (e.g. using input sanitization, principle of least privilege, etc)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@vincetran vincetran requested a review from a team as a code owner January 14, 2025 22:48
@tylerjroach
Copy link
Member

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Could you verify that if, as an app developer, you increase number of allowed workers, it will increase number of parts uploaded at a time? Making sure no speed decreases happen as a result, but otherwise this change seems good!

@vincetran
Copy link
Member Author

vincetran commented Jan 15, 2025

This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.

Could you verify that if, as an app developer, you increase number of allowed workers, it will increase number of parts uploaded at a time? Making sure no speed decreases happen as a result, but otherwise this change seems good!

Yeah, so to increase the number of concurrent parts being uploaded, the app developer needs to increase the number of threads like so:

WorkManager.initialize(
    context,
    Configuration.Builder()
         // Uses a fixed thread pool of size 8 threads.
        .setExecutor(Executors.newFixedThreadPool(8))
        .build())

The default number of threads is the number of cores the device has minus 1 (or 4 or 2 lol). From the WorkManger Configuration class:

private @NonNull Executor createDefaultExecutor(boolean isTaskExecutor) {
        return Executors.newFixedThreadPool(
            // This value is the same as the core pool size for AsyncTask#THREAD_POOL_EXECUTOR.
            Math.max(2, Math.min(Runtime.getRuntime().availableProcessors() - 1, 4)),
            createDefaultThreadFactory(isTaskExecutor));
}

So when I bumped the configuration to use 6 threads, I got 6 concurrent uploads with no degredation even in the same network conditions.

@vincetran vincetran merged commit 7581848 into main Jan 15, 2025
3 checks passed
@vincetran vincetran deleted the vincetran/multi-part-upload branch January 15, 2025 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants