__smallest_size_t might be a huge pessimization on most arhitectures #53

cristi1990an · 2024-12-23T21:17:34Z

While a sound idea in theory, I fear that this optimization's impact on performance far outweights any benefits it brings in size. In the restricted testing I've done, for repeated calls to push_back, using an uint8_t/uint16_t as the size resulted in more than 10 times worse performance than when using std::size_t. I feel like this is something that should be investigated a bit if the implementation is to be used in a performance-critical context.

wusatosi · 2025-01-11T19:33:09Z

Hi this sounds like a good area to research. I would expect this as well, considering the improvement in memory usage could be quickly voided by alignment. Though I am not expecting performance penalty this would result in.
Can you share your benchmark here?

cristi1990an · 2025-01-11T22:15:18Z

#include <chrono>
#include <print>
#include "inplace_vector.hpp"

struct trivial
{
	trivial() = default;
	trivial(const trivial&) = default;
	trivial& operator=(const trivial&) = default;
};

struct non_trivial
{
	non_trivial() {}
	non_trivial(const non_trivial&) { }
	non_trivial& operator=(const non_trivial&) { return *this; };
};

static_assert(sizeof(beman::inplace_vector<trivial, 4>) == 5);
static_assert(sizeof(beman::inplace_vector<trivial, 4, false>) == 16);
static_assert(sizeof(beman::inplace_vector<non_trivial, 4>) == 5);
static_assert(sizeof(beman::inplace_vector<non_trivial, 4, false>) == 16);

template <typename T, std::size_t Size, bool SizeOptimization>
auto benchmark_loop()
{
	volatile std::size_t sum = 0; // so the below code isn't optimized away
	const auto start = std::chrono::high_resolution_clock::now();

	beman::inplace_vector<T, Size, SizeOptimization> vec;

	for (auto _ : std::views::iota(std::size_t{}, Size))
	{
		sum += vec.size();
		vec.emplace_back();
	}

	const auto stop = std::chrono::high_resolution_clock::now();
	return std::make_pair((stop - start).count(), sum);
}

template <typename T, std::size_t Size, bool SizeOptimization, std::size_t Runs>
auto benchmark()
{
	volatile std::size_t sum = 0;
	for (auto _ : std::views::iota(std::size_t{}, Runs))
	{
		sum += benchmark_loop<T, Size, SizeOptimization>().first;
	}
	return sum / Runs;
}

template <typename T, std::size_t Size, std::size_t Runs>
void display_benchmark(std::string_view type_name)
{
	std::println("Benchmark result with T = {}, Size = {} ({} runs):\n"
		"With size opt: {} Without size opt: {}\n"
		"Without size opt: {} With size opt: {}\n",
		type_name, Size, Runs,
		benchmark<T, Size, true, Runs>(),
		benchmark<T, Size, false, Runs>(),
		benchmark<T, Size, false, Runs>(),
		benchmark<T, Size, true, Runs>());
}

int main()
{
	display_benchmark<int, 64 * 32, 50000>("int");
	display_benchmark<char, 126 * 32, 50000>("char");
	display_benchmark<std::string, 64 * 32, 50000>("std::string");
	display_benchmark<std::vector<int>, 64 * 32, 50000>("std::vector<int>");

	display_benchmark<std::vector<int>, 64 * 32, 50000>("std::vector<int>");
	display_benchmark<std::string, 64 * 32, 50000>("std::string");
	display_benchmark<char, 126 * 32, 50000>("char");
	display_benchmark<int, 64 * 32, 50000>("int");

    static constexpr auto ints = std::views::iota(0, (int)std::numeric_limits<std::uint8_t>::max());

    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint8_t>::max()> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 with size optimization (uint8): Time = {} Size = {}", stop - start, sum);
    }
    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint16_t>::max()> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 with size optimization (uint16): Time = {} Size = {}", stop - start, sum);
    }
    {
        const auto start = std::chrono::high_resolution_clock::now();
        std::size_t sum = 0;
        for (auto _ : std::views::iota(0, 10000))
        {
            beman::inplace_vector<int, std::numeric_limits<std::uint8_t>::max(), false> vec;

            for (int val : ints)
            {
                vec.try_push_back(val);
                sum += vec.size();
            }
        }
        const auto stop = std::chrono::high_resolution_clock::now();
        std::println("Benchmark 2 without size optimization (std::size_t): Time = {} Size = {}", stop - start, sum);
    }
}
/*
    Output on my PC: 

    Benchmark result with T = int, Size = 2048 (50000 runs) :
    With size opt : 2743 Without size opt : 1174
    Without size opt : 1170 With size opt : 2693

    Benchmark result with T = char, Size = 4032 (50000 runs) :
    With size opt : 5901 Without size opt : 5999
    Without size opt : 6022 With size opt : 6021

    Benchmark result with T = std::string, Size = 2048 (50000 runs) :
    With size opt : 2795 Without size opt : 2274
    Without size opt : 2293 With size opt : 2798

    Benchmark result with T = std::vector<int>, Size = 2048 (50000 runs) :
    With size opt : 3101 Without size opt : 1406
    Without size opt : 1407 With size opt : 3090

    Benchmark result with T = std::vector<int>, Size = 2048 (50000 runs) :
    With size opt : 3087 Without size opt : 1409
    Without size opt : 1409 With size opt : 3086

    Benchmark result with T = std::string, Size = 2048 (50000 runs) :
    With size opt : 2801 Without size opt : 2215
    Without size opt : 2238 With size opt : 2795

    Benchmark result with T = char, Size = 4032 (50000 runs) :
    With size opt : 5893 Without size opt : 5962
    Without size opt : 5938 With size opt : 5883

    Benchmark result with T = int, Size = 2048 (50000 runs) :
    With size opt : 2704 Without size opt : 1166
    Without size opt : 1182 With size opt : 2680

    Benchmark 2 with size optimization(uint8) : Time = 3361000ns Size = 326400000
    Benchmark 2 with size optimization(uint16) : Time = 983900ns Size = 326400000
    Benchmark 2 without size optimization(std::size_t) : Time = 1030500ns Size = 326400000
*/

jbab · 2025-01-14T21:46:42Z

I have played around with this a bit, too, on a MacBook M1 with -O3.
With gcc 14 I generally find that std::size_t performs better, especially with std::string and std::vector as element types.
With clang 19 I generally see no difference at all or even a slightly better performing size optimization.
But, most importantly, I see a lot of fluctuation in the measurements, also between runs, so I am not sure what we're actually measuring here.
I might make an attempt with google benchmark and see if I get more consistent results.

jbab · 2025-01-17T13:26:33Z

If have written a benchmark using the google benchmark utility.
benchmark.zip

I can no longer see any statistically relevant difference. Please see my attachment for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

__smallest_size_t might be a huge pessimization on most arhitectures #53

__smallest_size_t might be a huge pessimization on most arhitectures #53

cristi1990an commented Dec 23, 2024

wusatosi commented Jan 11, 2025

cristi1990an commented Jan 11, 2025

jbab commented Jan 14, 2025

jbab commented Jan 17, 2025

__smallest_size_t might be a huge pessimization on most arhitectures #53

__smallest_size_t might be a huge pessimization on most arhitectures #53

Comments

cristi1990an commented Dec 23, 2024

wusatosi commented Jan 11, 2025

cristi1990an commented Jan 11, 2025

jbab commented Jan 14, 2025

jbab commented Jan 17, 2025