Skip to content

Commit

Permalink
Merge pull request #172 from meilisearch/update-version-v0.7.0
Browse files Browse the repository at this point in the history
Update version for the next release (v0.7.0) in Cargo.toml files
  • Loading branch information
ManyTheFish authored Dec 14, 2022
2 parents 685c136 + b1c1e67 commit a19679c
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 7 deletions.
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "charabia"
version = "0.6.0"
version = "0.7.0"
license = "MIT"
authors = ["Many <[email protected]>"]
edition = "2021"
Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor
**Charabia is multilingual**, featuring optimized support for:


| Script - Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level |
| Script / Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level |
|---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---|
| **Latin** - **Any** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ lowercase + deunicode | 🟨 ~13MiB/sec | 🟧 ~5MiB/sec |
| **Chinese** - **CMN** 🇨🇳 |[jieba](https://github.com/messense/jieba-rs) | ✅ traditional-to-simplified conversion | 🟨 ~9MiB/sec | 🟧 ~5MiB/sec |
| **Hebrew** 🇮🇱 |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ diacritics removal | 🟩 ~21MiB/sec | 🟨 ~11MiB/sec |
| **Japanese** 🇯🇵 |[lindera](https://github.com/lindera-morphology/lindera) | ✅ convert to Hiragana | 🟧 ~5MiB/sec | 🟧 ~4MiB/sec |
| **Thai** 🇹🇭 |[dictionary based](https://github.com/PyThaiNLP/nlpo3) || 🟩 ~23MiB/sec | 🟨 ~14MiB/sec |
| **Latin** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟨 ~14MiB/sec | 🟨 ~8MiB/sec |
| **Cyrillic** - **Greek** - **Georgian** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | 🟨 ~14MiB/sec | 🟨 ~8MiB/sec |
| **Chinese** **CMN** 🇨🇳 |[jieba](https://github.com/messense/jieba-rs) |[compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~11MiB/sec | 🟧 ~6MiB/sec |
| **Hebrew** 🇮🇱 - **Arabic** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~22MiB/sec | 🟨 ~10MiB/sec |
| **Japanese** 🇯🇵 |[lindera](https://github.com/lindera-morphology/lindera) IPA-dict |[compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~5MiB/sec | 🟧 ~4MiB/sec |
| **Korean** 🇰🇷 |[lindera](https://github.com/lindera-morphology/lindera) KO-dict |[compatibility decomposition](https://unicode.org/reports/tr15/) | 🟥 ~2MiB/sec | 🟥 ~2MiB/sec |
| **Thai** 🇹🇭 |[dictionary based](https://github.com/PyThaiNLP/nlpo3) |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~26MiB/sec | 🟨 ~13MiB/sec |

We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).

Expand Down

0 comments on commit a19679c

Please sign in to comment.