Releases: meilisearch/charabia
Charabia v0.8.4
Changes
- Update Lindera to v0.27.1 for changing the UniDic download URL (#237) @mosuka
- Implement the CharNormalizer trait on the LowercaseNormalizer struct (#241) @Bradshaw
Thanks again to @Bradshaw, @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉
Charabia v0.8.3
Changes
- Remap the char map when lowercasing strings (#234) @Kerollmops
Thanks again to @Kerollmops, @dependabot, @dependabot[bot], @meili-bors[bot] ! 🎉
Charabia v0.8.2
Changes
- Update Lindera to 0.27.0 (#227) @mosuka
- Fix pre-segmenter when a string start by an uncategorized character (#231) @ManyTheFish
Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉
Charabia v0.8.1
Charabia v0.8.0
Changes
Main Changes
Add options to customize the Tokenizer's segmentation (#215)
A new separators
method has been added to the TokenizerBuilder
allowing to customize the separators that will be used to segment a text:
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];
//configure separators.
builder.separators(&separators);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);
A new words_dict
method has been added to the TokenizerBuilder
allowing to override the segmentation over some words:
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];
//configure words.
builder.words_dict(&words);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);
This words dictionary is used to override the segmentation over the dictionary's words,
the tokenizer will find all the occurrences of these words before any Language based segmentation.
If some of the terms are in the stop_words list or in the separators list,
they will be categorized as TokenKind::StopWord
or TokenKind::Separator
as well.
Other changes
- Update Lindera to 0.24.0 (#212) @mosuka
- Transform classifier into normalizer (#214) @ManyTheFish
- Handle underscore similarly to dash (#216) @vvv
- Enhance Japanese Tokenization (#218) @mosuka
- Add helper methods (#222) @ManyTheFish
Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], @mosuka and @vvv! 🎉
Charabia v0.7.2
Changes
Main Changes
- Split camelCase in Latin segmenter (#181) @goodhoko
- Improve Arabic Normalizer (#204) @DrAliRagab
- Improve Arabic language segmentation (#205) @DrAliRagab
- Enhance Quotation marks support (#211) @ManyTheFish
Misc
- Upgrade Lindera resolving the issue reported in #183 (#195) @mosuka
- Regularly check for kVariant updates (#192) @goodhoko
- Replace Regex with str::find (#196) @goodhoko
- Correct typo (#198) @CaroFG
- Fix compilation with no default features (#202) @akeamc
- Move and update Charabia readme (#199) @ManyTheFish
- Add readme symlink (#206) @ManyTheFish
- fix CI updating cargo version (#208) @ManyTheFish
- Update dependencies (#209) @ManyTheFish
Thanks again to @CaroFG, @DrAliRagab, @ManyTheFish, @akeamc, @dependabot, @dependabot[bot], @goodhoko, and @mosuka! 🎉
Charabia v0.7.1
Changes
- Reduce crate size by compressing dictionnaries (#171) @choznerol @ManyTheFish
- Fix script lang serialization (#180) @ManyTheFish
- feat: enable nonspacing-marks normalizer for Greek scripts and introduce Greek normalizer which unifies sigma character (#182) @cymruu
- add tatweel normalizer (#187) @james-2001
- update lindera (#190) @ManyTheFish
- Use irg-kvariant crate in Charabia (#191) @ManyTheFish
Thanks again to @ManyTheFish, @choznerol, @cymruu and @james-2001! 🎉
Charabia v0.7.0
Changes
- Add CI to update the Charabia version in Cargo.toml (#119) @curquiza
- Add dependabot for GHA (#122) @curquiza
- Upgrade ubuntu-18.04 to 20.04 (#125) @curquiza
- Upgrade lindera to 0.16.0 (#126) @mosuka
- Upgrade Whatlang dependency (#142) @Sokom141
- Implement Pinyin normalizer (#143) @crudiedo
- Add NonspacingMark normalizer (#146) @crudiedo
- Separate out FstSegmenter from ThaiSegmenter (#147) @daniel-shuy
- add allow list to tokenizer (#148) @yenwel
- Add korean support (#154) @qbx2
- Add Japanese normalizer to cover Katakana to Hiragana (#149) @choznerol
- Test thai homographs (#155) @Roms1383
- Disable HMM feature of Jieba (#158) @harshalkhachane
- Simplify normalizer implementation (#157) @ManyTheFish
- Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants (#162) @choznerol
- Fix incorrect File::read for kVariants.tsv (#165) @choznerol
- Add a Compatibility Decomposition Normalizer, remove Latin normalizer (#166) @dureuill
- impl name and from_name on enums (#152) @ManyTheFish
Breaking changes ⚠️
- Classify a Token before normalizing it avoiding to have false positive in stop words (#169) @ManyTheFish
Thanks again to @ManyTheFish, @Roms1383, @Sokom141, @choznerol, @crudiedo, @curquiza, @daniel-shuy, @dependabot, @dependabot[bot], @dureuill, @harshalkhachane, @mosuka, @qbx2 and @yenwel! 🎉
Charabia v0.6.0
Changes
- Resolving version mismatches occurring in Lindera (#112) @mosuka
- Add Thai segmenter (#114) @aFluffyHotdog
- Optimize Thai segmenter (#115) @ManyTheFish
- Deactivate lowercase normalizer when the Script doesn't contain case modifiers (#116) @ManyTheFish
- Release v0.6.0 (#117) @ManyTheFish
Breaking changes ⚠️
Add option to disable char map creation (#109) @matthias-wright
Token::original_lengths(..)
method, used to find the original index of a character in a normalized string, needs the TokenizerBuilder::create_char_map(..)
settings set to true
to work properly.
Thanks again to @ManyTheFish, @aFluffyHotdog, @matthias-wright and @mosuka! 🎉
Charabia v0.5.1
Changes
- Fix typo in docstring (#106) @matthias-wright
- Update Hebrew segmenter link to unicode-segmentation instead of Jieba (#108) @ManyTheFish
- Specify the exact version of lindera we’re using since they broke the compilation on a minor version (#110) @irevoire
Thanks again to @ManyTheFish, @irevoire and @matthias-wright! 🎉