Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin camelcase wrong segmentation #317

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PedroTurik
Copy link

@PedroTurik PedroTurik commented Oct 29, 2024

Pull Request

All interesting changes are in charabia/src/segmenter/latin/camel_case.rs
other changes are import reordering caused by cargo fmt.

This change definitely looks like a small speed regression, but I will be opening for review nonetheless.

First time contributor to any meilisearch repo.

Related issue

Fixes #289

What does this PR do?

This PR changes the Latin camelCase segmentation to address the following cases:

  • openSSL should be segmented as ['open', 'SSL']
  • openSSLError should be segmented as ['open', 'SSL', 'Error']

This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with StrGroupBy::linear_group_by afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )

Benchmarks

the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes

cargo bench

And what I believe is the relevant output is the following

segment/132/Latin/Eng   time:   [2.5651 µs 2.5714 µs 2.5786 µs]
                        change: [+0.6549% +0.8593% +1.0487%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

segment/132/Latin/Fra   time:   [2.3521 µs 2.3547 µs 2.3572 µs]
                        change: [-0.3826% -0.0443% +0.2662%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

segment/131/Latin/Deu   time:   [2.1346 µs 2.1372 µs 2.1403 µs]
                        change: [+0.4418% +0.5896% +0.7584%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

segment/363/Latin/Eng   time:   [6.6601 µs 6.6749 µs 6.6911 µs]
                        change: [+3.1214% +3.4133% +3.7267%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

segment/363/Latin/Fra   time:   [6.5405 µs 6.5481 µs 6.5566 µs]
                        change: [-0.4942% -0.2417% +0.0139%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

segment/365/Latin/Vie   time:   [5.6049 µs 5.6109 µs 5.6170 µs]
                        change: [-4.2406% -3.9811% -3.7043%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

tokenize/132/Latin/Eng  time:   [6.1291 µs 6.1349 µs 6.1406 µs]
                        change: [-1.4925% -1.3344% -1.1740%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
tokenize/132/Latin/Fra  time:   [6.9543 µs 6.9617 µs 6.9696 µs]
                        change: [-1.5821% -1.3067% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

tokenize/131/Latin/Deu  time:   [6.4793 µs 6.4916 µs 6.5050 µs]
                        change: [-3.1861% -2.9351% -2.6852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

tokenize/363/Latin/Eng  time:   [15.510 µs 15.523 µs 15.535 µs]
                        change: [-8.2776% -7.9652% -7.6477%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high mild
tokenize/363/Latin/Fra  time:   [18.026 µs 18.044 µs 18.064 µs]
                        change: [-2.4157% -1.5634% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

tokenize/365/Latin/Vie  time:   [27.604 µs 27.626 µs 27.651 µs]
                        change: [-2.5536% -1.9732% -1.4007%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

tokenize/354/Latin/Deu  time:   [17.874 µs 17.904 µs 17.936 µs]
                        change: [+2.2411% +2.4528% +2.6719%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

@PedroTurik PedroTurik marked this pull request as ready for review October 29, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

latin-camelcase feature make wrong segmentation
1 participant