-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues writing WAL files to S3 #1018
Comments
@sebasmannem |
import json
import os
import time
import boto3
WAL_SIZE = 16000000
ITERATIONS = 100
def read_creds(cred_file):
"""
Function to read credentials configuration for the S3 connection.
Must contain the following fields in JSON format:
- bucket_name
- endpoint_url
- access_key
- secret_key
:param cred_file: The file location of the credentials file
:return: credentials dictionary
"""
with open(cred_file, "r") as f:
cred_data = json.load(f)
return cred_data
def write_wal_file():
with open("test.wal", "wb") as f:
f.write(bytes(WAL_SIZE))
def create_session(creds):
"""
Create S3 connection based on Session and resource
:return: client, latency in ns
"""
start = time.perf_counter_ns()
s3 = boto3.Session(
aws_access_key_id=creds['access_key'],
aws_secret_access_key=creds['secret_key']
)
s3_client = s3.resource("s3", endpoint_url=creds['endpoint_url'],
config=CONFIG)
end = time.perf_counter_ns() - start
print(f"Connection made in {end / 1000000}ms ")
return s3_client
def run_session_file_in_memory(creds):
print("Session with put_object")
session = create_session(creds)
write_wal_file()
latency = [0 for _ in range(0, ITERATIONS)] # predefined list
throughput_s = time.perf_counter_ns()
for i in range(ITERATIONS):
l_s = time.perf_counter_ns()
with open("test.wal", "rb") as f:
data = f.read()
session.meta.client.put_object(Body=data, Bucket=creds['bucket_name'], Key='00_test_wal')
latency[i] = time.perf_counter_ns() - l_s
throughput_e = time.perf_counter_ns() - throughput_s
# Calculate Latency
latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
print(f"Average Latency: {latency}ms")
# Calculate Throughput
throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
print(f"Throughput: {throughput_result}MBit/s")
def run_session_file_reading(creds):
print("Session with upload_fileobj")
session = create_session(creds)
write_wal_file()
latency = [0 for _ in range(0, ITERATIONS)] # predefined list
throughput_s = time.perf_counter_ns()
for i in range(ITERATIONS):
l_s = time.perf_counter_ns()
with open("test.wal", "rb") as wal_file:
session.meta.client.upload_fileobj(
Fileobj=wal_file, Bucket=creds['bucket_name'], Key='00_test_wal'
)
latency[i] = time.perf_counter_ns() - l_s
throughput_e = time.perf_counter_ns() - throughput_s
# Calculate Latency
latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
print(f"Average Latency: {latency}ms")
# Calculate Throughput
throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
print(f"Throughput: {throughput_result}MBit/s")
def run_tests():
creds = read_creds("creds.json")
os.environ['REQUESTS_CA_BUNDLE'] = creds['REQUESTS_CA_BUNDLE']
run_session_file_reading(creds)
run_session_file_in_memory(creds)
if __name__ == '__main__':
run_tests()
"""
Creds.json looks like this:
{
"bucket_name": "",
"endpoint_url": "",
"access_key": "",
"secret_key": "",
"REQUESTS_CA_BUNDLE": ""
}
""" |
Hi @RickVenema, thanks for reporting this. I've been checking on this recently and I could indeed see some improvement when switching methods. Although probably not enough in order to solve the WAL-flooding scenario you mentioned, for example. I'm curious to know what results you got from your benchmarking tests, would you mind to share? |
@RickVenema we have a PoC (almost ready to add to the next release) which increases the speed of upload by 15-25% for WAL files. These speeds were tested with a Postgres server, not by creating random 16MB files. Does that align with what you saw when you were testing the two upload methods? |
Hi, yes, that aligns with my results. I got a little bit more (around 40-50%) but I think that comes from me using 16MB files of only 0 values in a simulated environment instead of using actual WAL files. 471.68MBit/s vs 319.74MBit/s are the results I got on our infrastructure Great to hear that it speeds up actual workloads with 15-25%, cant wait to test the new version in our environment! |
Thanks @RickVenema this will, if nothing changes, be part of the next release sometime in mid/lateNovember |
As discussed with @mnencia last friday:
We use CloudNativePG and write backups and WAL archiving to a local S3 appliance.
We noticed that writing WAL files did not perform well causing the WAL location to flood. And PostgreSQL to run into issues. While investigation options to optimize the connection to our local S3 appliance we tried to replicate the situation using a python script which generates 16MB of random data and uploaded to our local S3 appliance using the Boto3 library. We noticed that our script seemed surprisingly faster then the implementation in Barman-cloud-wal-archive used for archiving wal to S3. BTW, full backup does meet up with performance as we also reached with our python script.
We changed our script to use the implemented as build into barman-cloud-wal-archiving. And reproduced similar low performance as experienced with barman-cloud.
We expect the difference the way barman-cloud uses the Boto3 API. When using low level API (put_object) performance is high. When using high level API (upload_fileobj) we experience low performance.
Note that create multipart used by full backup also delivers better performance.
We dont know the exact reason, it might be that upload fileobj streams from a file object and put_object uses data that is loaded into memory. Also note that upload_fileobj is the preferred method to upload files >=5GB but WAL segment files are only 16MB.
We will also submit our test script for your information, could you please consider changing code from upload_fileobj to put_object. If you want to we could provide a pull request.
The text was updated successfully, but these errors were encountered: