Reduce retained memory use #280

aaronbee · 2024-12-24T00:56:56Z

Change deserialization of HBase responses to not retain any memory from the read buffer. This allows reusing read buffers.

dethi · 2024-12-27T17:56:11Z

Trade-off here is a lot more small allocations (for each cells) to be able to reuse the received buffers.

I think this need more than a benchmark on the isolated function, because the benchmark will show positive result but actual usage of gohbase that is processing the results as one may be negatively impacted by the increased allocations rate.

aaronbee · 2024-12-27T19:04:16Z

Trade-off here is a lot more small allocations (for each cells) to be able to reuse the received buffers.

I think this need more than a benchmark on the isolated function, because the benchmark will show positive result but actual usage of gohbase that is processing the results as one may be negatively impacted by the increased allocations rate.

Yeah, we are doing some testing now on a real cluster and I would like to write some more tests/benchmarks. Some thoughts:

We are already doing 3-4 allocations per cell thanks to the protobuf Cell value that is returned. This is adding just 1 more. And we are seeing on real clusters that the Cell struct and it's other fields are allocating more (in size in addition to count) than the []byte that is being created here! I would love to remove this Cell struct in our return, but that would be a change to the API.
I think this is the right thing to do anyway. The existing code is leaking the entire response buffer we are reading from HBase (technically: depends on whether we are decompressing or not whether it's the entire response buffer or just the cells). This is just wrong, IMO.

Ensure that no buffer memory is retained when serializing cell blocks. This does add an 1 allocation to cellFromCellBlock, which makes it slower, but we are already allocating 3 times in this function due to the protobuf structs. And more importantly this enables an optimization in the region client to be able to reuse buffers. goos: darwin goarch: arm64 pkg: github.com/tsuna/gohbase/hrpc cpu: Apple M4 Pro │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ DeserializeCellBlocks/1-14 51.97n ± 1% 74.60n ± 1% +43.53% (p=0.000 n=10) DeserializeCellBlocks/100-14 4.160µ ± 1% 6.249µ ± 1% +50.22% (p=0.000 n=10) geomean 465.0n 682.8n +46.84% │ before.txt │ after.txt │ │ B/op │ B/op vs base │ DeserializeCellBlocks/1-14 200.0 ± 0% 232.0 ± 0% +16.00% (p=0.000 n=10) DeserializeCellBlocks/100-14 19.62Ki ± 0% 22.75Ki ± 0% +15.92% (p=0.000 n=10) geomean 1.958Ki 2.270Ki +15.96% │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ DeserializeCellBlocks/1-14 4.000 ± 0% 5.000 ± 0% +25.00% (p=0.000 n=10) DeserializeCellBlocks/100-14 301.0 ± 0% 401.0 ± 0% +33.22% (p=0.000 n=10) geomean 34.70 44.78 +29.05%

Reuse buffers for reading responses from HBase, both for the compressed response and decompressed response. This should reduce memory pressure when reading lots of data.

aaronbee · 2025-01-01T00:38:20Z

I added a benchmark of the changes and the results don't look that promising. It's possible that my inputs for the test are not fully representative.

goos: darwin
goarch: arm64
pkg: github.com/tsuna/gohbase/region
cpu: Apple M4 Pro
                                        │ before.txt  │            poolbytes.txt            │
                                        │   sec/op    │   sec/op     vs base                │
Receive/mutate-14                         153.0n ± 0%   186.0n ± 1%  +21.57% (p=0.000 n=10)
Receive/multiMutate100-14                 11.98µ ± 2%   11.87µ ± 1%        ~ (p=0.271 n=10)
Receive/scanResult2MBwithCompression-14   522.2µ ± 0%   703.0µ ± 1%  +34.61% (p=0.000 n=10)
geomean                                   9.854µ        11.58µ       +17.48%

                                        │  before.txt  │            poolbytes.txt             │
                                        │     B/op     │     B/op      vs base                │
Receive/mutate-14                           140.0 ± 0%     160.0 ± 0%  +14.29% (p=0.000 n=10)
Receive/multiMutate100-14                 26.85Ki ± 0%   26.26Ki ± 0%   -2.17% (p=0.000 n=10)
Receive/scanResult2MBwithCompression-14   3.458Mi ± 0%   3.317Mi ± 0%   -4.08% (p=0.000 n=10)
geomean                                   23.51Ki        24.07Ki        +2.36%

                                        │ before.txt  │             poolbytes.txt             │
                                        │  allocs/op  │  allocs/op   vs base                  │
Receive/mutate-14                          5.000 ± 0%    5.000 ± 0%        ~ (p=1.000 n=10) ¹
Receive/multiMutate100-14                  415.0 ± 0%    415.0 ± 0%        ~ (p=1.000 n=10) ¹
Receive/scanResult2MBwithCompression-14   21.34k ± 0%   28.45k ± 0%  +33.32% (p=0.000 n=10)
geomean                                    353.8         389.4       +10.06%
¹ all samples are equal

Receive/scanResult2MBwithCompression is the case I was really attempting to optimize. It represents a large (2MB uncompressed) scan response. The goal of these changes was to reduce the memory allocated when handling this large response. There is however only a 4% reduction in bytes allocated, but at a 35% cost to CPU time and a 33% increase to number of allocations.

Given the small improvement to bytes allocated, we probably don't want to continue pursuing this change. Though, it does still have the advantage that each response has its own heap allocation and doesn't retain pointers to other responses.

aaronbee · 2025-01-03T21:21:06Z

Dropping this PR because of the lackluster benchmarking results. I'll start a new PR to add the receive benchmark.

aaronbee force-pushed the buffers branch from a7f5a7f to a353c32 Compare December 24, 2024 01:34

aaronbee added 3 commits December 31, 2024 16:12

Add BenchmarkReceive

50e7a6b

region: Use bufferPool for read buffers

ec68361

Reuse buffers for reading responses from HBase, both for the compressed response and decompressed response. This should reduce memory pressure when reading lots of data.

aaronbee force-pushed the buffers branch from a353c32 to ec68361 Compare January 1, 2025 00:29

aaronbee closed this Jan 3, 2025

aaronbee deleted the buffers branch January 3, 2025 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce retained memory use #280

Reduce retained memory use #280

aaronbee commented Dec 24, 2024

dethi commented Dec 27, 2024

aaronbee commented Dec 27, 2024

aaronbee commented Jan 1, 2025

aaronbee commented Jan 3, 2025

Reduce retained memory use #280

Reduce retained memory use #280

Conversation

aaronbee commented Dec 24, 2024

dethi commented Dec 27, 2024

aaronbee commented Dec 27, 2024

aaronbee commented Jan 1, 2025

aaronbee commented Jan 3, 2025