Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__smallest_size_t might be a huge pessimization on most arhitectures #53

Open
cristi1990an opened this issue Dec 23, 2024 · 4 comments
Open

Comments

@cristi1990an
Copy link

While a sound idea in theory, I fear that this optimization's impact on performance far outweights any benefits it brings in size. In the restricted testing I've done, for repeated calls to push_back, using an uint8_t/uint16_t as the size resulted in more than 10 times worse performance than when using std::size_t. I feel like this is something that should be investigated a bit if the implementation is to be used in a performance-critical context.

@wusatosi
Copy link
Member

Hi this sounds like a good area to research. I would expect this as well, considering the improvement in memory usage could be quickly voided by alignment. Though I am not expecting performance penalty this would result in.
Can you share your benchmark here?

@cristi1990an
Copy link
Author

#include <chrono>
#include <print>
#include "inplace_vector.hpp"

struct trivial
{
	trivial() = default;
	trivial(const trivial&) = default;
	trivial& operator=(const trivial&) = default;
};

struct non_trivial
{
	non_trivial() {}
	non_trivial(const non_trivial&) { }
	non_trivial& operator=(const non_trivial&) { return *this; };
};

static_assert(sizeof(beman::inplace_vector<trivial, 4>) == 5);
static_assert(sizeof(beman::inplace_vector<trivial, 4, false>) == 16);
static_assert(sizeof(beman::inplace_vector<non_trivial, 4>) == 5);
static_assert(sizeof(beman::inplace_vector<non_trivial, 4, false>) == 16);

template <typename T, std::size_t Size, bool SizeOptimization>
auto benchmark_loop()
{
	volatile std::size_t sum = 0; // so the below code isn't optimized away
	const auto start = std::chrono::high_resolution_clock::now();

	beman::inplace_vector<T, Size, SizeOptimization> vec;

	for (auto _ : std::views::iota(std::size_t{}, Size))
	{
		sum += vec.size();
		vec.emplace_back();
	}

	const auto stop = std::chrono::high_resolution_clock::now();
	return std::make_pair((stop - start).count(), sum);
}

template <typename T, std::size_t Size, bool SizeOptimization, std::size_t Runs>
auto benchmark()
{
	volatile std::size_t sum = 0;
	for (auto _ : std::views::iota(std::size_t{}, Runs))
	{
		sum += benchmark_loop<T, Size, SizeOptimization>().first;
	}
	return sum / Runs;
}

template <typename T, std::size_t Size, std::size_t Runs>
void display_benchmark(std::string_view type_name)
{
	std::println("Benchmark result with T = {}, Size = {} ({} runs):\n"
		"With size opt: {} Without size opt: {}\n"
		"Without size opt: {} With size opt: {}\n",
		type_name, Size, Runs,
		benchmark<T, Size, true, Runs>(),
		benchmark<T, Size, false, Runs>(),
		benchmark<T, Size, false, Runs>(),
		benchmark<T, Size, true, Runs>());
}

int main()
{
	display_benchmark<int, 64 * 32, 50000>("int");
	display_benchmark<char, 126 * 32, 50000>("char");
	display_benchmark<std::string, 64 * 32, 50000>("std::string");
	display_benchmark<std::vector<int>, 64 * 32, 50000>("std::vector<int>");

	display_benchmark<std::vector<int>, 64 * 32, 50000>("std::vector<int>");
	display_benchmark<std::string, 64 * 32, 50000>("std::string");
	display_benchmark<char, 126 * 32, 50000>("char");
	display_benchmark<int, 64 * 32, 50000>("int");

    static constexpr auto ints = std::views::iota(0, (int)std::numeric_limits<std::uint8_t>::max());

    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint8_t>::max()> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 with size optimization (uint8): Time = {} Size = {}", stop - start, sum);
    }
    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint16_t>::max()> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 with size optimization (uint16): Time = {} Size = {}", stop - start, sum);
    }
    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint8_t>::max(), false> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 without size optimization (std::size_t): Time = {} Size = {}", stop - start, sum);
    }
}
/*
    Output on my PC: 

    Benchmark result with T = int, Size = 2048 (50000 runs) :
    With size opt : 2743 Without size opt : 1174
    Without size opt : 1170 With size opt : 2693

    Benchmark result with T = char, Size = 4032 (50000 runs) :
    With size opt : 5901 Without size opt : 5999
    Without size opt : 6022 With size opt : 6021

    Benchmark result with T = std::string, Size = 2048 (50000 runs) :
    With size opt : 2795 Without size opt : 2274
    Without size opt : 2293 With size opt : 2798

    Benchmark result with T = std::vector<int>, Size = 2048 (50000 runs) :
    With size opt : 3101 Without size opt : 1406
    Without size opt : 1407 With size opt : 3090

    Benchmark result with T = std::vector<int>, Size = 2048 (50000 runs) :
    With size opt : 3087 Without size opt : 1409
    Without size opt : 1409 With size opt : 3086

    Benchmark result with T = std::string, Size = 2048 (50000 runs) :
    With size opt : 2801 Without size opt : 2215
    Without size opt : 2238 With size opt : 2795

    Benchmark result with T = char, Size = 4032 (50000 runs) :
    With size opt : 5893 Without size opt : 5962
    Without size opt : 5938 With size opt : 5883

    Benchmark result with T = int, Size = 2048 (50000 runs) :
    With size opt : 2704 Without size opt : 1166
    Without size opt : 1182 With size opt : 2680

    Benchmark 2 with size optimization(uint8) : Time = 3361000ns Size = 326400000
    Benchmark 2 with size optimization(uint16) : Time = 983900ns Size = 326400000
    Benchmark 2 without size optimization(std::size_t) : Time = 1030500ns Size = 326400000
*/

@jbab
Copy link
Collaborator

jbab commented Jan 14, 2025

I have played around with this a bit, too, on a MacBook M1 with -O3.
With gcc 14 I generally find that std::size_t performs better, especially with std::string and std::vector as element types.
With clang 19 I generally see no difference at all or even a slightly better performing size optimization.
But, most importantly, I see a lot of fluctuation in the measurements, also between runs, so I am not sure what we're actually measuring here.
I might make an attempt with google benchmark and see if I get more consistent results.

@jbab
Copy link
Collaborator

jbab commented Jan 17, 2025

If have written a benchmark using the google benchmark utility.
benchmark.zip

I can no longer see any statistically relevant difference. Please see my attachment for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants