Skip to content

Releases: meilisearch/charabia

Charabia v0.8.4

20 Sep 11:23
8ca0156
Compare
Choose a tag to compare

Changes

  • Update Lindera to v0.27.1 for changing the UniDic download URL (#237) @mosuka
  • Implement the CharNormalizer trait on the LowercaseNormalizer struct (#241) @Bradshaw

Thanks again to @Bradshaw, @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉

Charabia v0.8.3

22 Aug 12:02
de62ab9
Compare
Choose a tag to compare

Changes

Thanks again to @Kerollmops, @dependabot, @dependabot[bot], @meili-bors[bot] ! 🎉

Charabia v0.8.2

19 Jul 10:08
a191131
Compare
Choose a tag to compare

Changes

Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], and @mosuka! 🎉

Charabia v0.8.1

29 Jun 14:40
5f8abfe
Compare
Choose a tag to compare

Changes

Thanks again to @ManyTheFish! 🎉

Charabia v0.8.0

29 Jun 11:36
f3cc03b
Compare
Choose a tag to compare

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

A new separators method has been added to the TokenizerBuilder allowing to customize the separators that will be used to segment a text:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];

//configure separators.
builder.separators(&separators);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);

A new words_dict method has been added to the TokenizerBuilder allowing to override the segmentation over some words:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];

//configure words.
builder.words_dict(&words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);

This words dictionary is used to override the segmentation over the dictionary's words,
the tokenizer will find all the occurrences of these words before any Language based segmentation.
If some of the terms are in the stop_words list or in the separators list,
they will be categorized as TokenKind::StopWord or TokenKind::Separator as well.

Other changes

Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], @mosuka and @vvv! 🎉

Charabia v0.7.2

26 Apr 11:59
cd1ad65
Compare
Choose a tag to compare

Changes

Main Changes

Misc

Thanks again to @CaroFG, @DrAliRagab, @ManyTheFish, @akeamc, @dependabot, @dependabot[bot], @goodhoko, and @mosuka! 🎉

Charabia v0.7.1

16 Feb 13:18
e2bc8d1
Compare
Choose a tag to compare

Changes

Thanks again to @ManyTheFish, @choznerol, @cymruu and @james-2001! 🎉

Charabia v0.7.0

14 Dec 14:53
037f912
Compare
Choose a tag to compare

Changes

Breaking changes ⚠️

  • Classify a Token before normalizing it avoiding to have false positive in stop words (#169) @ManyTheFish

Thanks again to @ManyTheFish, @Roms1383, @Sokom141, @choznerol, @crudiedo, @curquiza, @daniel-shuy, @dependabot, @dependabot[bot], @dureuill, @harshalkhachane, @mosuka, @qbx2 and @yenwel! 🎉

Charabia v0.6.0

22 Aug 12:11
faafb12
Compare
Choose a tag to compare

Changes

Breaking changes ⚠️

Add option to disable char map creation (#109) @matthias-wright

Token::original_lengths(..) method, used to find the original index of a character in a normalized string, needs the TokenizerBuilder::create_char_map(..) settings set to true to work properly.

Thanks again to @ManyTheFish, @aFluffyHotdog, @matthias-wright and @mosuka! 🎉

Charabia v0.5.1

05 Jul 10:01
17b0f5f
Compare
Choose a tag to compare

Changes

  • Fix typo in docstring (#106) @matthias-wright
  • Update Hebrew segmenter link to unicode-segmentation instead of Jieba (#108) @ManyTheFish
  • Specify the exact version of lindera we’re using since they broke the compilation on a minor version (#110) @irevoire

Thanks again to @ManyTheFish, @irevoire and @matthias-wright! 🎉