fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available: N/A
Problem: We found that with multi-part uploads that took a long time (e.g. a large file being uploaded or a slow network connection) would eventually cause
SocketTimeoutException
s. After a long investigation, we found that the issue is with creating a lot ofCoroutineWorkers
and queuing them up at the same time. Since all of the pending part uploads were being queued at the same time,WorkManager
would actually queue them ALL of them at the same time because the S3 uploadPart call is a suspending function. So a part upload may "start" at T:0, get suspended because 20 other parts are trying to upload at the same time, and then not actually continue until 5 minutes later. And since the socket connection was already open, it would timeout because nothing got sent.Description of change:
The fix to this was to change the underlying
Worker
for the PartUploader from aCoroutineWorker
(suspending) to a plainWorker
(blocking).This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.
Had to get creative with abstraction because of the
RouterWorker
that's being used to route a work item into the appropriate subclassed *Worker.BaseTransferWorker
got converted from anabstract class
to aninterface
and then logic got moved to abstract classesSuspendingTransferWorker
andBlockingTransferWorker
as appropriate.How did you test these changes?
Besides testing multi-part, I regression tested all of the child Workers that got touched. Test cases:
WorkManager
could continue pending Work.Documentation update required?
General Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.