Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strpos and contains performance improvements #14211

Closed
wants to merge 6 commits into from

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Jan 20, 2025

Which issue does this PR close?

Rationale for this change

Perf.

What changes are included in this PR?

Code.

Are these changes tested?

Yes, via existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added development-process Related to development process of DataFusion functions labels Jan 20, 2025
@Omega359
Copy link
Contributor Author

I would really appreciate it if reviewers could verify the performance improvements on any hardware they have available - mac, deployment servers, etc using the cargo benchmarks listed in #14210

@Omega359 Omega359 marked this pull request as ready for review January 21, 2025 13:54
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 -- this is really cool to see this level of attention to performance

I think @samuelcolvin (and maybe @adriangb ) did some pretty hard core optimization over in arrow-rs recently on

@@ -286,11 +286,17 @@ jobs:
uses: ./.github/actions/setup-builder
with:
rust-version: stable
- name: Install emscripten dependency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this needed for?

rand = { workspace = true }
regex = { workspace = true, optional = true }
sha2 = { version = "^0.10.1", optional = true }
stringzilla = { version = "3", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is these new library dependencies as our dependency. I haven't looked at it in detail but I think we need to evaluate how likely it is for it to be maintained, etc

https://github.com/ashvardanian/stringzilla looks pretty solild though after a quick look

Also, I think we should be planning to upstream as many of the low level operations (like string contains) as possible to arrow-rs (so we can have a larger user base to help improve them)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think @samuelcolvin 's experience was that the different implemnetations were faster/slower on some architectures / cahce sizes. So unless we have significant evidence this is faster for most/all platforms I would be hesitant to include it

@@ -0,0 +1,162 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to break this benchmark into its own PR?

This is mostly a selfish ask as I have comparison scripts for performance that compare performance against main, and they only work if the benchmark is on main as well

@Omega359 Omega359 closed this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential performance improvement in contains and strpos functions
2 participants