[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

new5558 · 2025-01-11T08:08:12Z

What does this changes

Fixes #973

What was wrong

How this fixes it

Fixes #...

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test
Test Attacut and icu manually
Benchmark speed improvement

pep8speaks · 2025-01-11T08:08:19Z

Hello @new5558! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2025-01-11 15:09:30 UTC

coveralls · 2025-01-11T08:14:06Z

coverage: 52.795% (-0.04%) from 52.836%
when pulling d8d22f6 on new5558:dev
into 3aa57c6 on PyThaiNLP:dev.

new5558 · 2025-01-11T14:38:59Z

Manual test of Attacut & ICU tokenizers results and benchmark results can be found in this Colab Notebook

Performance Improvements

(Time per tokenizer function call)

Tokenizer	Original (ms)	This PR (ms)	Original 50 threads (ms)	This PR 50 threads (ms)
Attacut	9.83 ms	1.67 ms	17.18 ms	2.83 ms
ICU	0.03 ms	0.03 ms	0.19 ms	0.27 ms
Longest	0.02 ms	0.01 ms	0.21 ms	0.20 ms

Attacut engine is 6X Faster when tokenizing words "สวัสดี PythaiNLP".
Not much improvement in ICU and Longest engines,

sonarqubecloud · 2025-01-11T15:10:05Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

wannaphong

Thank you! 💯

new5558 added 2 commits January 11, 2025 08:02

perf: improve word tokenizer speed

9907cb3

Merge branch 'dev' of https://github.com/new5558/pythainlp into dev

9efd6e7

fix: CI error

7532488

new5558 changed the title ~~[WIP] Reduce reload word tokenizer engine in word_tokenize~~ Reduce reload word tokenizer engine in word_tokenize Jan 11, 2025

fix: CI error 2

ef44e60

new5558 changed the title ~~Reduce reload word tokenizer engine in word_tokenize~~ [Ready] Reduce reload word tokenizer engine in word_tokenize Jan 11, 2025

fix: CI error 3

d8d22f6

wannaphong approved these changes Jan 13, 2025

View reviewed changes

wannaphong merged commit ae4c5fa into PyThaiNLP:dev Jan 13, 2025
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

new5558 commented Jan 11, 2025 •

edited

Loading

pep8speaks commented Jan 11, 2025 •

edited

Loading

coveralls commented Jan 11, 2025 •

edited

Loading

new5558 commented Jan 11, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 11, 2025

wannaphong left a comment

[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

[Ready] Reduce reload word tokenizer engine in word_tokenize #1064

Conversation

new5558 commented Jan 11, 2025 • edited Loading

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

pep8speaks commented Jan 11, 2025 • edited Loading

Comment last updated at 2025-01-11 15:09:30 UTC

coveralls commented Jan 11, 2025 • edited Loading

new5558 commented Jan 11, 2025 • edited Loading

Performance Improvements

sonarqubecloud bot commented Jan 11, 2025

Quality Gate passed

wannaphong left a comment

Choose a reason for hiding this comment

new5558 commented Jan 11, 2025 •

edited

Loading

pep8speaks commented Jan 11, 2025 •

edited

Loading

coveralls commented Jan 11, 2025 •

edited

Loading

new5558 commented Jan 11, 2025 •

edited

Loading