Latin camelcase wrong segmentation #317

PedroTurik · 2024-10-29T04:15:48Z

Pull Request

All interesting changes are in charabia/src/segmenter/latin/camel_case.rs
other changes are import reordering caused by cargo fmt.

This change definitely looks like a small speed regression, but I will be opening for review nonetheless.

First time contributor to any meilisearch repo.

Related issue

Fixes #289

What does this PR do?

This PR changes the Latin camelCase segmentation to address the following cases:

openSSL should be segmented as ['open', 'SSL']
openSSLError should be segmented as ['open', 'SSL', 'Error']

This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with StrGroupBy::linear_group_by afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )

Benchmarks

the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes

cargo bench

And what I believe is the relevant output is the following

segment/132/Latin/Eng   time:   [2.5651 µs 2.5714 µs 2.5786 µs]
                        change: [+0.6549% +0.8593% +1.0487%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

segment/132/Latin/Fra   time:   [2.3521 µs 2.3547 µs 2.3572 µs]
                        change: [-0.3826% -0.0443% +0.2662%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

segment/131/Latin/Deu   time:   [2.1346 µs 2.1372 µs 2.1403 µs]
                        change: [+0.4418% +0.5896% +0.7584%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

segment/363/Latin/Eng   time:   [6.6601 µs 6.6749 µs 6.6911 µs]
                        change: [+3.1214% +3.4133% +3.7267%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

segment/363/Latin/Fra   time:   [6.5405 µs 6.5481 µs 6.5566 µs]
                        change: [-0.4942% -0.2417% +0.0139%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

segment/365/Latin/Vie   time:   [5.6049 µs 5.6109 µs 5.6170 µs]
                        change: [-4.2406% -3.9811% -3.7043%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

tokenize/132/Latin/Eng  time:   [6.1291 µs 6.1349 µs 6.1406 µs]
                        change: [-1.4925% -1.3344% -1.1740%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
tokenize/132/Latin/Fra  time:   [6.9543 µs 6.9617 µs 6.9696 µs]
                        change: [-1.5821% -1.3067% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

tokenize/131/Latin/Deu  time:   [6.4793 µs 6.4916 µs 6.5050 µs]
                        change: [-3.1861% -2.9351% -2.6852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

tokenize/363/Latin/Eng  time:   [15.510 µs 15.523 µs 15.535 µs]
                        change: [-8.2776% -7.9652% -7.6477%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high mild
tokenize/363/Latin/Fra  time:   [18.026 µs 18.044 µs 18.064 µs]
                        change: [-2.4157% -1.5634% -1.0380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

tokenize/365/Latin/Vie  time:   [27.604 µs 27.626 µs 27.651 µs]
                        change: [-2.5536% -1.9732% -1.4007%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

tokenize/354/Latin/Deu  time:   [17.874 µs 17.904 µs 17.936 µs]
                        change: [+2.2411% +2.4528% +2.6719%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

…oc + lindera due to environment problems

PedroTurik added 3 commits October 28, 2024 18:18

Changes segmentation strategy for latin camelCase, and disables jmall…

f015a44

…oc + lindera due to environment problems

reintroduces the code previously removed for testing

8c9f8ef

Fix grammar

6d8ae94

PedroTurik marked this pull request as ready for review October 29, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin camelcase wrong segmentation #317

Latin camelcase wrong segmentation #317

PedroTurik commented Oct 29, 2024 •

edited

Loading

Latin camelcase wrong segmentation #317

Are you sure you want to change the base?

Latin camelcase wrong segmentation #317

Conversation

PedroTurik commented Oct 29, 2024 • edited Loading

Pull Request

Related issue

What does this PR do?

This PR changes the Latin camelCase segmentation to address the following cases:

Benchmarks

PR checklist

PedroTurik commented Oct 29, 2024 •

edited

Loading