Strpos and contains performance improvements #14211

Omega359 · 2025-01-20T16:34:28Z

Which issue does this PR close?

Closes Potential performance improvement in contains and strpos functions #14210

Rationale for this change

Perf.

What changes are included in this PR?

Code.

Are these changes tested?

Yes, via existing tests.

Are there any user-facing changes?

No.

…rformance.

Omega359 · 2025-01-20T17:43:39Z

I would really appreciate it if reviewers could verify the performance improvements on any hardware they have available - mac, deployment servers, etc using the cargo benchmarks listed in #14210

alamb

Thanks @Omega359 -- this is really cool to see this level of attention to performance

I think @samuelcolvin (and maybe @adriangb ) did some pretty hard core optimization over in arrow-rs recently on

alamb · 2025-01-23T23:08:33Z

.github/workflows/rust.yml

@@ -286,11 +286,17 @@ jobs:
        uses: ./.github/actions/setup-builder
        with:
          rust-version: stable
+      - name: Install emscripten dependency


what is this needed for?

alamb · 2025-01-23T23:18:16Z

datafusion/functions/Cargo.toml

 rand = { workspace = true }
 regex = { workspace = true, optional = true }
 sha2 = { version = "^0.10.1", optional = true }
+stringzilla = { version = "3", optional = true }


My biggest concern is these new library dependencies as our dependency. I haven't looked at it in detail but I think we need to evaluate how likely it is for it to be maintained, etc

https://github.com/ashvardanian/stringzilla looks pretty solild though after a quick look

Also, I think we should be planning to upstream as many of the low level operations (like string contains) as possible to arrow-rs (so we can have a larger user base to help improve them)

I also think @samuelcolvin 's experience was that the different implemnetations were faster/slower on some architectures / cahce sizes. So unless we have significant evidence this is faster for most/all platforms I would be hesitant to include it

alamb · 2025-01-23T23:20:27Z

datafusion/functions/benches/contains.rs

@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Would it be possible to break this benchmark into its own PR?

This is mostly a selfish ask as I have comparison scripts for performance that compare performance against main, and they only work if the benchmark is on main as well

Omega359 added 6 commits January 20, 2025 02:52

Update contains and strpos functions to use stringzilla to improve pe…

c9d6f45

…rformance.

Install clang for stringzilla build under wasm.

dfbe693

Cargo lock update for cli.

765e440

Adding build-essential to get stdio.h header.

1d86702

Adding libc6-dev to get stdio.h header.

d6a8bb6

Trying with emcc

7827b08

github-actions bot added development-process Related to development process of DataFusion functions labels Jan 20, 2025

Omega359 marked this pull request as ready for review January 21, 2025 13:54

alamb reviewed Jan 23, 2025

View reviewed changes

Omega359 closed this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strpos and contains performance improvements #14211

Strpos and contains performance improvements #14211

Omega359 commented Jan 20, 2025 •

edited by alamb

Loading

Omega359 commented Jan 20, 2025

alamb left a comment

alamb Jan 23, 2025

alamb Jan 23, 2025

alamb Jan 23, 2025

alamb Jan 23, 2025

		@@ -0,0 +1,162 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Strpos and contains performance improvements #14211

Strpos and contains performance improvements #14211

Conversation

Omega359 commented Jan 20, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 commented Jan 20, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 23, 2025

Choose a reason for hiding this comment

alamb Jan 23, 2025

Choose a reason for hiding this comment

alamb Jan 23, 2025

Choose a reason for hiding this comment

alamb Jan 23, 2025

Choose a reason for hiding this comment

Omega359 commented Jan 20, 2025 •

edited by alamb

Loading