Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance regression in split by avoiding allocating substring per char #237

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Dec 12, 2024

This PR fixes a performance regression from #227 / 4c85bde which I overlooked in review:

When generalizing the optimized non-Pattern-based split code, that commit introduced a .substring() on each character, producing tons of garbage.

Instead, I think we can do a .startsWith(splitPattern, i): this should be much faster because it will avoid unnecessary garbage string creation (plus I'm pretty sure that startsWith is optimized in modern JDKs).

I also removed the use of breakable and replaced it with an update to the while condition.

@JoshRosen JoshRosen changed the title Fix performance regression in splitLimit by avoiding allocating substring per char Fix performance regression in split by avoiding allocating substring per char Dec 12, 2024
@stephenamar-db stephenamar-db merged commit 680b1a8 into databricks:master Dec 12, 2024
6 checks passed
@JoshRosen JoshRosen deleted the fix-split-perf-regression branch December 31, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants