Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SELECTs performance and add RowCursor::{decoded_bytes,received_bytes} #169

Merged
merged 3 commits into from
Oct 20, 2024

Conversation

loyd
Copy link
Collaborator

@loyd loyd commented Oct 18, 2024

Summary

This PR introduces RowCursor::{decoded_bytes,received_bytes} methods to provide stats useful for telemetry on the client side.

Besides this, the performance of SELECT queries is significantly improved:

  • BufList (to handle situations when one row is related to several HTTP chunks) is replaced with simple Bytes and cloning in slow path cases. So, RowBinaryDeserializer now accepts &[u8], which is much cheaper than BufList. This change achieves the most performance gain (~80%).
  • Remove a complex logic with a closure in RawCursor (~30%).
  • Expose RowBinaryCursor directly as RowCursor to avoid one extra state machine in the RowCursor::next() future (~5%).
  • Remove extra cloning of buffers in the lz4 decoder (~7% in the select/lz4 bench).
  • Use decompress_size_prepended to avoid extra vec![0; size] if lz4_flex is not compiled with the safe-decode feature (~5% in the select/lz4 bench).
  • Don't copy bytes for borrowed usage (not covered by benchmarks).

Also, this is the first step in exposing RawCursor (#152). RawCursor will be exposed with the added async collect(self) -> Result<Bytes> method and examples in the following PRs.

"select" bench

Intel(R) Xeon(R) Gold 6226R @ 2.90 GHz

Before (on 7b599ef):

select/no compression   time:   [33.485 ns 33.565 ns 33.649 ns]
                        thrpt:  [680.19 MiB/s 681.90 MiB/s 683.53 MiB/s]
select/lz4              time:   [37.391 ns 37.448 ns 37.509 ns]
                        thrpt:  [610.20 MiB/s 611.19 MiB/s 612.13 MiB/s]

After:

select/no compression   time:   [14.051 ns 14.096 ns 14.146 ns]
                        thrpt:  [1.5801 GiB/s 1.5857 GiB/s 1.5908 GiB/s]
                 change:
                        time:   [-57.932% -57.576% -57.199%] (p = 0.00 < 0.05)
                        thrpt:  [+133.64% +135.71% +137.71%]
                        Performance has improved.
select/lz4              time:   [17.116 ns 17.182 ns 17.253 ns]
                        thrpt:  [1.2955 GiB/s 1.3009 GiB/s 1.3059 GiB/s]
                 change:
                        time:   [-54.684% -54.240% -53.801%] (p = 0.00 < 0.05)
                        thrpt:  [+116.46% +118.53% +120.67%]
                        Performance has improved.

AMD Ryzen 7 4800HS @ 2.90 GHz

Before (on 7b599ef):

select/no compression   time:   [37.901 ns 37.981 ns 38.076 ns]
                        thrpt:  [601.12 MiB/s 602.63 MiB/s 603.90 MiB/s]
select/lz4              time:   [41.244 ns 41.318 ns 41.401 ns]
                        thrpt:  [552.85 MiB/s 553.95 MiB/s 554.95 MiB/s]

After:

select/no compression   time:   [15.214 ns 15.288 ns 15.368 ns]
                        thrpt:  [1.4545 GiB/s 1.4621 GiB/s 1.4691 GiB/s]
                 change:
                        time:   [-60.124% -59.455% -58.770%] (p = 0.00 < 0.05)
                        thrpt:  [+142.54% +146.64% +150.78%]
                        Performance has improved.
select/lz4              time:   [17.214 ns 17.279 ns 17.353 ns]
                        thrpt:  [1.2880 GiB/s 1.2936 GiB/s 1.2984 GiB/s]
                 change:
                        time:   [-58.850% -58.060% -57.372%] (p = 0.00 < 0.05)
                        thrpt:  [+134.59% +138.43% +143.01%]
                        Performance has improved.

"select_numbers" bench

CPU: AMD Ryzen 7 4800HS @ 2.90 GHz
CH: 24.9.2 (localhost, docker)

Before (on 7b599ef):

compress  elapsed
    none   6.257s
     lz4   8.301s

(chosen worst among five runs)

After:

compress  elapsed  throughput  received
    none   2.868s  1330 MiB/s  3815 MiB
     lz4   6.301s   605 MiB/s  1908 MiB

(chosen worst among five runs)

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided so that we can include it in CHANGELOG later
  • For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

Also, add `RowCursor::received_bytes()` and `RowCursor::decoded_bytes()`
@Arjentix
Copy link

I've tested this branch on my query.

Row structure looks something like this:

pub struct Row {
    pub id: u32,
    pub string_1: String,
    pub string_2: String,
    pub names: Vec<String>,
    pub kinds: Vec<String>,
    pub templates: Vec<String>,
    pub states: Vec<String>,
    pub some_stuff_names: Vec<Vec<String>>,
    pub some_stuff_contents: Vec<Vec<bytes::Bytes>>,
}

Most of the time it worked well even before this branch. But then data became quite bigger and it started getting too slow. Here are the details for a query like SELECT * FROM table WHERE id in [12345] which loads a lot of data.
I compare fix/mock-with-advanced-time (I think there is no difference with master, so I'll call it master) and feat/query_stats

master: ~37s
feat/query_stats: ~8s
clickhouse client: ~6.5s

So this is much closer to the native client time but still there is a room for optimizations

serprex
serprex previously approved these changes Oct 19, 2024
@loyd
Copy link
Collaborator Author

loyd commented Oct 19, 2024

@serprex, the review is dismissed after changing CHANGELOG.md =( Can you approve PR again?

@loyd loyd merged commit b4cbb39 into main Oct 20, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants