Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce retained memory use #280

Closed
wants to merge 3 commits into from
Closed

Reduce retained memory use #280

wants to merge 3 commits into from

Conversation

aaronbee
Copy link
Collaborator

Change deserialization of HBase responses to not retain any memory from the read buffer. This allows reusing read buffers.

@dethi
Copy link
Collaborator

dethi commented Dec 27, 2024

Trade-off here is a lot more small allocations (for each cells) to be able to reuse the received buffers.

I think this need more than a benchmark on the isolated function, because the benchmark will show positive result but actual usage of gohbase that is processing the results as one may be negatively impacted by the increased allocations rate.

@aaronbee
Copy link
Collaborator Author

Trade-off here is a lot more small allocations (for each cells) to be able to reuse the received buffers.

I think this need more than a benchmark on the isolated function, because the benchmark will show positive result but actual usage of gohbase that is processing the results as one may be negatively impacted by the increased allocations rate.

Yeah, we are doing some testing now on a real cluster and I would like to write some more tests/benchmarks. Some thoughts:

  1. We are already doing 3-4 allocations per cell thanks to the protobuf Cell value that is returned. This is adding just 1 more. And we are seeing on real clusters that the Cell struct and it's other fields are allocating more (in size in addition to count) than the []byte that is being created here! I would love to remove this Cell struct in our return, but that would be a change to the API.

  2. I think this is the right thing to do anyway. The existing code is leaking the entire response buffer we are reading from HBase (technically: depends on whether we are decompressing or not whether it's the entire response buffer or just the cells). This is just wrong, IMO.

Ensure that no buffer memory is retained when serializing cell blocks.

This does add an 1 allocation to cellFromCellBlock, which makes it
slower, but we are already allocating 3 times in this function due to
the protobuf structs. And more importantly this enables an
optimization in the region client to be able to reuse buffers.

goos: darwin
goarch: arm64
pkg: github.com/tsuna/gohbase/hrpc
cpu: Apple M4 Pro
                             │ before.txt  │              after.txt              │
                             │   sec/op    │   sec/op     vs base                │
DeserializeCellBlocks/1-14     51.97n ± 1%   74.60n ± 1%  +43.53% (p=0.000 n=10)
DeserializeCellBlocks/100-14   4.160µ ± 1%   6.249µ ± 1%  +50.22% (p=0.000 n=10)
geomean                        465.0n        682.8n       +46.84%

                             │  before.txt  │              after.txt               │
                             │     B/op     │     B/op      vs base                │
DeserializeCellBlocks/1-14       200.0 ± 0%     232.0 ± 0%  +16.00% (p=0.000 n=10)
DeserializeCellBlocks/100-14   19.62Ki ± 0%   22.75Ki ± 0%  +15.92% (p=0.000 n=10)
geomean                        1.958Ki        2.270Ki       +15.96%

                             │ before.txt │             after.txt              │
                             │ allocs/op  │ allocs/op   vs base                │
DeserializeCellBlocks/1-14     4.000 ± 0%   5.000 ± 0%  +25.00% (p=0.000 n=10)
DeserializeCellBlocks/100-14   301.0 ± 0%   401.0 ± 0%  +33.22% (p=0.000 n=10)
geomean                        34.70        44.78       +29.05%
Reuse buffers for reading responses from HBase, both for the
compressed response and decompressed response. This should reduce
memory pressure when reading lots of data.
@aaronbee
Copy link
Collaborator Author

aaronbee commented Jan 1, 2025

I added a benchmark of the changes and the results don't look that promising. It's possible that my inputs for the test are not fully representative.

goos: darwin
goarch: arm64
pkg: github.com/tsuna/gohbase/region
cpu: Apple M4 Pro
                                        │ before.txt  │            poolbytes.txt            │
                                        │   sec/op    │   sec/op     vs base                │
Receive/mutate-14                         153.0n ± 0%   186.0n ± 1%  +21.57% (p=0.000 n=10)
Receive/multiMutate100-14                 11.98µ ± 2%   11.87µ ± 1%        ~ (p=0.271 n=10)
Receive/scanResult2MBwithCompression-14   522.2µ ± 0%   703.0µ ± 1%  +34.61% (p=0.000 n=10)
geomean                                   9.854µ        11.58µ       +17.48%

                                        │  before.txt  │            poolbytes.txt             │
                                        │     B/op     │     B/op      vs base                │
Receive/mutate-14                           140.0 ± 0%     160.0 ± 0%  +14.29% (p=0.000 n=10)
Receive/multiMutate100-14                 26.85Ki ± 0%   26.26Ki ± 0%   -2.17% (p=0.000 n=10)
Receive/scanResult2MBwithCompression-14   3.458Mi ± 0%   3.317Mi ± 0%   -4.08% (p=0.000 n=10)
geomean                                   23.51Ki        24.07Ki        +2.36%

                                        │ before.txt  │             poolbytes.txt             │
                                        │  allocs/op  │  allocs/op   vs base                  │
Receive/mutate-14                          5.000 ± 0%    5.000 ± 0%        ~ (p=1.000 n=10) ¹
Receive/multiMutate100-14                  415.0 ± 0%    415.0 ± 0%        ~ (p=1.000 n=10) ¹
Receive/scanResult2MBwithCompression-14   21.34k ± 0%   28.45k ± 0%  +33.32% (p=0.000 n=10)
geomean                                    353.8         389.4       +10.06%
¹ all samples are equal

Receive/scanResult2MBwithCompression is the case I was really attempting to optimize. It represents a large (2MB uncompressed) scan response. The goal of these changes was to reduce the memory allocated when handling this large response. There is however only a 4% reduction in bytes allocated, but at a 35% cost to CPU time and a 33% increase to number of allocations.

Given the small improvement to bytes allocated, we probably don't want to continue pursuing this change. Though, it does still have the advantage that each response has its own heap allocation and doesn't retain pointers to other responses.

@aaronbee
Copy link
Collaborator Author

aaronbee commented Jan 3, 2025

Dropping this PR because of the lackluster benchmarking results. I'll start a new PR to add the receive benchmark.

@aaronbee aaronbee closed this Jan 3, 2025
@aaronbee aaronbee deleted the buffers branch January 3, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants