Releases · meilisearch/charabia

20 Sep 11:23

meili-bot

v0.8.4

8ca0156

Charabia v0.8.4

Changes

Update Lindera to v0.27.1 for changing the UniDic download URL (#237) @mosuka
Implement the CharNormalizer trait on the LowercaseNormalizer struct (#241) @Bradshaw

Thanks again to @Bradshaw, @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉

Contributors

Bradshaw, mosuka, and 2 other contributors

Assets 2

22 Aug 12:02

meili-bot

v0.8.3

de62ab9

Charabia v0.8.3

Changes

Remap the char map when lowercasing strings (#234) @Kerollmops

Thanks again to @Kerollmops, @dependabot, @dependabot[bot], @meili-bors[bot] ! 🎉

Contributors

Kerollmops and dependabot

Assets 2

19 Jul 10:08

meili-bot

v0.8.2

a191131

Charabia v0.8.2

Changes

Update Lindera to 0.27.0 (#227) @mosuka
Fix pre-segmenter when a string start by an uncategorized character (#231) @ManyTheFish

Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉

Contributors

mosuka, ManyTheFish, and dependabot

Assets 2

29 Jun 14:40

meili-bot

v0.8.1

5f8abfe

Charabia v0.8.1

Changes

Remove colon from hard separators (#223) @ManyTheFish

Thanks again to @ManyTheFish! 🎉

Contributors

ManyTheFish

Assets 2

29 Jun 11:36

meili-bot

v0.8.0

f3cc03b

Charabia v0.8.0

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

A new separators method has been added to the TokenizerBuilder allowing to customize the separators that will be used to segment a text:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];

//configure separators.
builder.separators(&separators);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);

A new words_dict method has been added to the TokenizerBuilder allowing to override the segmentation over some words:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];

//configure words.
builder.words_dict(&words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);

This words dictionary is used to override the segmentation over the dictionary's words,
the tokenizer will find all the occurrences of these words before any Language based segmentation.
If some of the terms are in the stop_words list or in the separators list,
they will be categorized as TokenKind::StopWord or TokenKind::Separator as well.

Other changes

Update Lindera to 0.24.0 (#212) @mosuka
Transform classifier into normalizer (#214) @ManyTheFish
Handle underscore similarly to dash (#216) @vvv
Enhance Japanese Tokenization (#218) @mosuka
Add helper methods (#222) @ManyTheFish

Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], @mosuka and @vvv! 🎉

Contributors

vvv, mosuka, and 2 other contributors

Assets 2

26 Apr 11:59

meili-bot

v0.7.2

cd1ad65

Charabia v0.7.2

Changes

Main Changes

Split camelCase in Latin segmenter (#181) @goodhoko
Improve Arabic Normalizer (#204) @DrAliRagab
Improve Arabic language segmentation (#205) @DrAliRagab
Enhance Quotation marks support (#211) @ManyTheFish

Misc

Upgrade Lindera resolving the issue reported in #183 (#195) @mosuka
Regularly check for kVariant updates (#192) @goodhoko
Replace Regex with str::find (#196) @goodhoko
Correct typo (#198) @CaroFG
Fix compilation with no default features (#202) @akeamc
Move and update Charabia readme (#199) @ManyTheFish
Add readme symlink (#206) @ManyTheFish
fix CI updating cargo version (#208) @ManyTheFish
Update dependencies (#209) @ManyTheFish

Thanks again to @CaroFG, @DrAliRagab, @ManyTheFish, @akeamc, @dependabot, @dependabot[bot], @goodhoko, and @mosuka! 🎉

Contributors

mosuka, ManyTheFish, and 5 other contributors

Assets 2

16 Feb 13:18

meili-bot

v0.7.1

e2bc8d1

Charabia v0.7.1

Changes

Reduce crate size by compressing dictionnaries (#171) @choznerol @ManyTheFish
Fix script lang serialization (#180) @ManyTheFish
feat: enable nonspacing-marks normalizer for Greek scripts and introduce Greek normalizer which unifies sigma character (#182) @cymruu
add tatweel normalizer (#187) @james-2001
update lindera (#190) @ManyTheFish
Use irg-kvariant crate in Charabia (#191) @ManyTheFish

Thanks again to @ManyTheFish, @choznerol, @cymruu and @james-2001! 🎉

Contributors

cymruu, ManyTheFish, and 2 other contributors

Assets 2

14 Dec 14:53

meili-bot

v0.7.0

037f912

Charabia v0.7.0

Changes

Add CI to update the Charabia version in Cargo.toml (#119) @curquiza
Add dependabot for GHA (#122) @curquiza
Upgrade ubuntu-18.04 to 20.04 (#125) @curquiza
Upgrade lindera to 0.16.0 (#126) @mosuka
Upgrade Whatlang dependency (#142) @Sokom141
Implement Pinyin normalizer (#143) @crudiedo
Add NonspacingMark normalizer (#146) @crudiedo
Separate out FstSegmenter from ThaiSegmenter (#147) @daniel-shuy
add allow list to tokenizer (#148) @yenwel
Add korean support (#154) @qbx2
Add Japanese normalizer to cover Katakana to Hiragana (#149) @choznerol
Test thai homographs (#155) @Roms1383
Disable HMM feature of Jieba (#158) @harshalkhachane
Simplify normalizer implementation (#157) @ManyTheFish
Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants (#162) @choznerol
Fix incorrect File::read for kVariants.tsv (#165) @choznerol
Add a Compatibility Decomposition Normalizer, remove Latin normalizer (#166) @dureuill
impl name and from_name on enums (#152) @ManyTheFish

Breaking changes ⚠️

Classify a Token before normalizing it avoiding to have false positive in stop words (#169) @ManyTheFish

Thanks again to @ManyTheFish, @Roms1383, @Sokom141, @choznerol, @crudiedo, @curquiza, @daniel-shuy, @dependabot, @dependabot[bot], @dureuill, @harshalkhachane, @mosuka, @qbx2 and @yenwel! 🎉

Contributors

mosuka, yenwel, and 11 other contributors

Assets 2

22 Aug 12:11

meili-bot

v0.6.0

faafb12

Charabia v0.6.0

Changes

Resolving version mismatches occurring in Lindera (#112) @mosuka
Add Thai segmenter (#114) @aFluffyHotdog
Optimize Thai segmenter (#115) @ManyTheFish
Deactivate lowercase normalizer when the Script doesn't contain case modifiers (#116) @ManyTheFish
Release v0.6.0 (#117) @ManyTheFish

Breaking changes ⚠️

Add option to disable char map creation (#109) @matthias-wright

Token::original_lengths(..) method, used to find the original index of a character in a normalized string, needs the TokenizerBuilder::create_char_map(..) settings set to true to work properly.

Thanks again to @ManyTheFish, @aFluffyHotdog, @matthias-wright and @mosuka! 🎉

Contributors

mosuka, ManyTheFish, and 2 other contributors

Assets 2

05 Jul 10:01

meili-bot

v0.5.1

17b0f5f

Charabia v0.5.1

Changes

Fix typo in docstring (#106) @matthias-wright
Update Hebrew segmenter link to unicode-segmentation instead of Jieba (#108) @ManyTheFish
Specify the exact version of lindera we’re using since they broke the compilation on a minor version (#110) @irevoire

Thanks again to @ManyTheFish, @irevoire and @matthias-wright! 🎉

Contributors

ManyTheFish, irevoire, and matthias-wright

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes

Contributors

Changes

Contributors

Changes

Contributors

Changes

Contributors

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

Other changes

Contributors

Changes

Main Changes

Misc

Contributors

Changes

Contributors

Changes

Breaking changes ⚠️

Contributors

Changes

Breaking changes ⚠️

Add option to disable char map creation (#109) @matthias-wright

Contributors

Changes

Contributors

Releases: meilisearch/charabia

Charabia v0.8.4

Changes

Contributors

Charabia v0.8.3

Changes

Contributors

Charabia v0.8.2

Changes

Contributors

Charabia v0.8.1

Changes

Contributors

Charabia v0.8.0

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

Other changes

Contributors

Charabia v0.7.2

Changes

Main Changes

Misc

Contributors

Charabia v0.7.1

Changes

Contributors

Charabia v0.7.0

Changes

Breaking changes ⚠️

Contributors

Charabia v0.6.0

Changes

Breaking changes ⚠️

Add option to disable char map creation (#109) @matthias-wright

Contributors

Charabia v0.5.1

Changes

Contributors