Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues writing WAL files to S3 #1018

Open
RickVenema opened this issue Sep 19, 2024 · 6 comments
Open

Performance Issues writing WAL files to S3 #1018

RickVenema opened this issue Sep 19, 2024 · 6 comments

Comments

@RickVenema
Copy link

As discussed with @mnencia last friday:

We use CloudNativePG and write backups and WAL archiving to a local S3 appliance.

We noticed that writing WAL files did not perform well causing the WAL location to flood. And PostgreSQL to run into issues. While investigation options to optimize the connection to our local S3 appliance we tried to replicate the situation using a python script which generates 16MB of random data and uploaded to our local S3 appliance using the Boto3 library. We noticed that our script seemed surprisingly faster then the implementation in Barman-cloud-wal-archive used for archiving wal to S3. BTW, full backup does meet up with performance as we also reached with our python script.

We changed our script to use the implemented as build into barman-cloud-wal-archiving. And reproduced similar low performance as experienced with barman-cloud.

We expect the difference the way barman-cloud uses the Boto3 API. When using low level API (put_object) performance is high. When using high level API (upload_fileobj) we experience low performance.

Note that create multipart used by full backup also delivers better performance.

We dont know the exact reason, it might be that upload fileobj streams from a file object and put_object uses data that is loaded into memory. Also note that upload_fileobj is the preferred method to upload files >=5GB but WAL segment files are only 16MB.

We will also submit our test script for your information, could you please consider changing code from upload_fileobj to put_object. If you want to we could provide a pull request.

@RickVenema
Copy link
Author

@sebasmannem
Please follow this issue

@RickVenema
Copy link
Author

import json
import os
import time

import boto3


WAL_SIZE = 16000000
ITERATIONS = 100


def read_creds(cred_file):
    """
    Function to read credentials configuration for the S3 connection.
    Must contain the following fields in JSON format:
    - bucket_name
    - endpoint_url
    - access_key
    - secret_key

    :param cred_file: The file location of the credentials file
    :return: credentials dictionary
    """
    with open(cred_file, "r") as f:
        cred_data = json.load(f)
    return cred_data


def write_wal_file():
    with open("test.wal", "wb") as f:
        f.write(bytes(WAL_SIZE))


def create_session(creds):
    """
    Create S3 connection based on Session and resource
    :return: client, latency in ns
    """
    start = time.perf_counter_ns()
    s3 = boto3.Session(
        aws_access_key_id=creds['access_key'],
        aws_secret_access_key=creds['secret_key']
    )
    s3_client = s3.resource("s3", endpoint_url=creds['endpoint_url'],
                            config=CONFIG)
    end = time.perf_counter_ns() - start
    print(f"Connection made in {end / 1000000}ms ")
    return s3_client


def run_session_file_in_memory(creds):
    print("Session with put_object")
    session = create_session(creds)
    write_wal_file()
    latency = [0 for _ in range(0, ITERATIONS)]  # predefined list
    throughput_s = time.perf_counter_ns()
    for i in range(ITERATIONS):
        l_s = time.perf_counter_ns()
        with open("test.wal", "rb") as f:
            data = f.read()
        session.meta.client.put_object(Body=data, Bucket=creds['bucket_name'], Key='00_test_wal')
        latency[i] = time.perf_counter_ns() - l_s
    throughput_e = time.perf_counter_ns() - throughput_s

    # Calculate Latency
    latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
    print(f"Average Latency: {latency}ms")

    # Calculate Throughput
    throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
    print(f"Throughput: {throughput_result}MBit/s")


def run_session_file_reading(creds):
    print("Session with upload_fileobj")
    session = create_session(creds)
    write_wal_file()
    latency = [0 for _ in range(0, ITERATIONS)]  # predefined list
    throughput_s = time.perf_counter_ns()
    for i in range(ITERATIONS):
        l_s = time.perf_counter_ns()
        with open("test.wal", "rb") as wal_file:
            session.meta.client.upload_fileobj(
                Fileobj=wal_file, Bucket=creds['bucket_name'], Key='00_test_wal'
            )
        latency[i] = time.perf_counter_ns() - l_s
    throughput_e = time.perf_counter_ns() - throughput_s

    # Calculate Latency
    latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
    print(f"Average Latency: {latency}ms")

    # Calculate Throughput
    throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
    print(f"Throughput: {throughput_result}MBit/s")


def run_tests():
    creds = read_creds("creds.json")
    os.environ['REQUESTS_CA_BUNDLE'] = creds['REQUESTS_CA_BUNDLE']

    run_session_file_reading(creds)
    run_session_file_in_memory(creds)


if __name__ == '__main__':
    run_tests()

"""
Creds.json looks like this:
{
  "bucket_name": "",
  "endpoint_url": "",
  "access_key": "",
  "secret_key": "",
  "REQUESTS_CA_BUNDLE": ""
}
"""

@gustabowill
Copy link
Contributor

Hi @RickVenema, thanks for reporting this. I've been checking on this recently and I could indeed see some improvement when switching methods. Although probably not enough in order to solve the WAL-flooding scenario you mentioned, for example. I'm curious to know what results you got from your benchmarking tests, would you mind to share?

@martinmarques
Copy link
Contributor

@RickVenema we have a PoC (almost ready to add to the next release) which increases the speed of upload by 15-25% for WAL files. These speeds were tested with a Postgres server, not by creating random 16MB files.

Does that align with what you saw when you were testing the two upload methods?

@RickVenema
Copy link
Author

Hi,

yes, that aligns with my results. I got a little bit more (around 40-50%) but I think that comes from me using 16MB files of only 0 values in a simulated environment instead of using actual WAL files.

471.68MBit/s vs 319.74MBit/s are the results I got on our infrastructure

Great to hear that it speeds up actual workloads with 15-25%, cant wait to test the new version in our environment!

@martinmarques
Copy link
Contributor

Thanks @RickVenema this will, if nothing changes, be part of the next release sometime in mid/lateNovember

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants