Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Slow Transfers Speeds #839

Open
3 tasks done
zatkinson08 opened this issue Sep 27, 2024 · 3 comments
Open
3 tasks done

S3 Slow Transfers Speeds #839

zatkinson08 opened this issue Sep 27, 2024 · 3 comments

Comments

@zatkinson08
Copy link

Problem description

  • What are you trying to achieve?
    Transfer AMI binary between S3 us-east-2 & us-gov-east-1

  • What is the expected result?
    Transfer completes with performance on-par or close to AWS CLI.

  • What are you seeing instead?
    Transfer is taking ~2-3min per gigabyte which is much slower than CLI.

Steps/code to reproduce the problem

To be clear, smart_open IS working. However, I will not be able to use it for my project because the speed is too slow. My largest file presently is ~27GB. At 2min per gigabytes that is ~54min to transfer a single file. Am I utilize this project correctly? If there are suggestions to increase performance, I would very much appreciate more info.

def copy_between_s3(src_bucket, src_ami_id, dest_bucket, s3_client_src, s3_client_dest):
  logger.info("Copying between S3 buckets..")
  read_path=f"{src_bucket}/{src_ami_id}.bin"
  write_path=f"{dest_bucket}/{src_ami_id}.bin"

  # optional transport_param;
  min_file_chunks_in_bytes = (1 * 1024**3)  # 1 * 1024**3 bytes -> 1-GB
  buffer_size = (1 * 1024**3) # optional transport_param; Slow than default

  try:
    with open(f"s3://{read_path}", mode='rb', transport_params={'client': s3_client_src, 'buffer_size': buffer_size, 'min_part_size':min_file_chunks_in_bytes}) as fr:
      with open(f"s3://{write_path}", mode='wb', transport_params={'client': s3_client_dest, 'buffer_size': buffer_size, 'min_part_size':min_file_chunks_in_bytes}) as fw:
        for line in fr:
          fw.write(line)
  except Exception as e:
     logger.error(f"\t copy_between_s3: {e}")

AWS CLI - python output as well

## Python - Only reporting 4GB of 27GB here
2024-09-27 09:23:17,255 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 1, 1073741824 bytes (total 1.000GB)
2024-09-27 09:24:25,093 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 2, 1073741824 bytes (total 2.000GB)
2024-09-27 09:25:33,845 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 3, 1073741824 bytes (total 3.000GB)
2024-09-27 09:26:37,991 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 4, 1073741824 bytes (total 4.000GB)


## AWS CLI - Time reflects full 27GB download & upload
time aws s3 cp s3://SRC_BUCKET_NAME/ami-002145465c5357f2f.bin . # Download
# ~100MiB/s avg - Output -  121.27s user 106.51s system 84% cpu 4:28.12 total

time aws s3 cp ./ami-002145465c5357f2f.bin s3://DEST_BUCKET_NAME/ami-002145465c5357f2f.bin  # Upload
# ~117MiB/s avg - Output - 186.19s user 215.25s system 164% cpu 4:04.40 total

Versions

>>> print("Python", sys.version)
Python 3.9.2 (default, Sep 23 2024, 11:08:05) 
[GCC 11.4.0]
>>> print("smart_open", smart_open.__version__)
smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@ddelange
Copy link
Contributor

ddelange commented Sep 27, 2024

did you try reading and writing buffer_size byte chunks instead of reading and writing line-by-line? for multipart upload you can go up to smart_open.s3.MAX_PART_SIZE (5GiB).

while (chunk := fr.read(buffer_size)):
    fw.write(chunk)

the line iterator checks every character for carriage returns: big chance your code is CPU bound and not IO bound.

@ddelange
Copy link
Contributor

ddelange commented Sep 28, 2024

if you have enough RAM/swap, you can save yourself some API charges by doing only a single GET (doing a single fr.read() without size argument) and then a single PUT (doing a single fw.write() with multipart_upload=False transport_param).

multiple chunk reads (GETs) and multiple part writes (PUTs, plus init, plus commit) are all billed by AWS

@ddelange
Copy link
Contributor

ddelange commented Oct 9, 2024

hi @zatkinson08 👋

we're you able to try out my suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants