Improve SELECTs performance and add `RowCursor::{decoded_bytes,received_bytes}` #169

loyd · 2024-10-18T09:15:53Z

Summary

This PR introduces RowCursor::{decoded_bytes,received_bytes} methods to provide stats useful for telemetry on the client side.

Besides this, the performance of SELECT queries is significantly improved:

BufList (to handle situations when one row is related to several HTTP chunks) is replaced with simple Bytes and cloning in slow path cases. So, RowBinaryDeserializer now accepts &[u8], which is much cheaper than BufList. This change achieves the most performance gain (~80%).
Remove a complex logic with a closure in RawCursor (~30%).
Expose RowBinaryCursor directly as RowCursor to avoid one extra state machine in the RowCursor::next() future (~5%).
Remove extra cloning of buffers in the lz4 decoder (~7% in the select/lz4 bench).
Use decompress_size_prepended to avoid extra vec![0; size] if lz4_flex is not compiled with the safe-decode feature (~5% in the select/lz4 bench).
Don't copy bytes for borrowed usage (not covered by benchmarks).

Also, this is the first step in exposing RawCursor (#152). RawCursor will be exposed with the added async collect(self) -> Result<Bytes> method and examples in the following PRs.

"select" bench

Intel(R) Xeon(R) Gold 6226R @ 2.90 GHz

Before (on 7b599ef):

select/no compression   time:   [33.485 ns 33.565 ns 33.649 ns]
                        thrpt:  [680.19 MiB/s 681.90 MiB/s 683.53 MiB/s]
select/lz4              time:   [37.391 ns 37.448 ns 37.509 ns]
                        thrpt:  [610.20 MiB/s 611.19 MiB/s 612.13 MiB/s]

After:

select/no compression   time:   [14.051 ns 14.096 ns 14.146 ns]
                        thrpt:  [1.5801 GiB/s 1.5857 GiB/s 1.5908 GiB/s]
                 change:
                        time:   [-57.932% -57.576% -57.199%] (p = 0.00 < 0.05)
                        thrpt:  [+133.64% +135.71% +137.71%]
                        Performance has improved.
select/lz4              time:   [17.116 ns 17.182 ns 17.253 ns]
                        thrpt:  [1.2955 GiB/s 1.3009 GiB/s 1.3059 GiB/s]
                 change:
                        time:   [-54.684% -54.240% -53.801%] (p = 0.00 < 0.05)
                        thrpt:  [+116.46% +118.53% +120.67%]
                        Performance has improved.

AMD Ryzen 7 4800HS @ 2.90 GHz

Before (on 7b599ef):

select/no compression   time:   [37.901 ns 37.981 ns 38.076 ns]
                        thrpt:  [601.12 MiB/s 602.63 MiB/s 603.90 MiB/s]
select/lz4              time:   [41.244 ns 41.318 ns 41.401 ns]
                        thrpt:  [552.85 MiB/s 553.95 MiB/s 554.95 MiB/s]

After:

select/no compression   time:   [15.214 ns 15.288 ns 15.368 ns]
                        thrpt:  [1.4545 GiB/s 1.4621 GiB/s 1.4691 GiB/s]
                 change:
                        time:   [-60.124% -59.455% -58.770%] (p = 0.00 < 0.05)
                        thrpt:  [+142.54% +146.64% +150.78%]
                        Performance has improved.
select/lz4              time:   [17.214 ns 17.279 ns 17.353 ns]
                        thrpt:  [1.2880 GiB/s 1.2936 GiB/s 1.2984 GiB/s]
                 change:
                        time:   [-58.850% -58.060% -57.372%] (p = 0.00 < 0.05)
                        thrpt:  [+134.59% +138.43% +143.01%]
                        Performance has improved.

"select_numbers" bench

CPU: AMD Ryzen 7 4800HS @ 2.90 GHz
CH: 24.9.2 (localhost, docker)

Before (on 7b599ef):

compress  elapsed
    none   6.257s
     lz4   8.301s

(chosen worst among five runs)

After:

compress  elapsed  throughput  received
    none   2.868s  1330 MiB/s  3815 MiB
     lz4   6.301s   605 MiB/s  1908 MiB

(chosen worst among five runs)

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided so that we can include it in CHANGELOG later
For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

Also, add `RowCursor::received_bytes()` and `RowCursor::decoded_bytes()`

Arjentix · 2024-10-18T11:22:46Z

I've tested this branch on my query.

Row structure looks something like this:

pub struct Row {
    pub id: u32,
    pub string_1: String,
    pub string_2: String,
    pub names: Vec<String>,
    pub kinds: Vec<String>,
    pub templates: Vec<String>,
    pub states: Vec<String>,
    pub some_stuff_names: Vec<Vec<String>>,
    pub some_stuff_contents: Vec<Vec<bytes::Bytes>>,
}

Most of the time it worked well even before this branch. But then data became quite bigger and it started getting too slow. Here are the details for a query like SELECT * FROM table WHERE id in [12345] which loads a lot of data.
I compare fix/mock-with-advanced-time (I think there is no difference with master, so I'll call it master) and feat/query_stats

master: ~37s
feat/query_stats: ~8s
clickhouse client: ~6.5s

So this is much closer to the native client time but still there is a room for optimizations

loyd · 2024-10-19T12:17:59Z

@serprex, the review is dismissed after changing CHANGELOG.md =( Can you approve PR again?

test(benches): improve cases

7b599ef

loyd requested review from serprex and slvrtrn October 18, 2024 09:15

loyd force-pushed the feat/query-stats branch from 38feb60 to dd5f3e5 Compare October 18, 2024 09:17

perf(query): improve SELECTs performance

c2eb5d3

Also, add `RowCursor::received_bytes()` and `RowCursor::decoded_bytes()`

loyd force-pushed the feat/query-stats branch from dd5f3e5 to c2eb5d3 Compare October 18, 2024 09:19

serprex previously approved these changes Oct 19, 2024

View reviewed changes

docs(CHANGELOG): note about #169

7ca90f5

loyd dismissed serprex’s stale review via 7ca90f5 October 19, 2024 12:14

serprex approved these changes Oct 19, 2024

View reviewed changes

loyd merged commit b4cbb39 into main Oct 20, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SELECTs performance and add `RowCursor::{decoded_bytes,received_bytes}` #169

Improve SELECTs performance and add `RowCursor::{decoded_bytes,received_bytes}` #169

loyd commented Oct 18, 2024 •

edited

Loading

Arjentix commented Oct 18, 2024

loyd commented Oct 19, 2024

Improve SELECTs performance and add RowCursor::{decoded_bytes,received_bytes} #169

Improve SELECTs performance and add RowCursor::{decoded_bytes,received_bytes} #169

Conversation

loyd commented Oct 18, 2024 • edited Loading

Summary

"select" bench

Intel(R) Xeon(R) Gold 6226R @ 2.90 GHz

AMD Ryzen 7 4800HS @ 2.90 GHz

"select_numbers" bench

Checklist

Arjentix commented Oct 18, 2024

loyd commented Oct 19, 2024

Improve SELECTs performance and add `RowCursor::{decoded_bytes,received_bytes}` #169

Improve SELECTs performance and add `RowCursor::{decoded_bytes,received_bytes}` #169

loyd commented Oct 18, 2024 •

edited

Loading