Release Charabia v0.8.0 · meilisearch/charabia

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

A new separators method has been added to the TokenizerBuilder allowing to customize the separators that will be used to segment a text:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];

//configure separators.
builder.separators(&separators);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);

A new words_dict method has been added to the TokenizerBuilder allowing to override the segmentation over some words:

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];

//configure words.
builder.words_dict(&words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);

This words dictionary is used to override the segmentation over the dictionary's words,
the tokenizer will find all the occurrences of these words before any Language based segmentation.
If some of the terms are in the stop_words list or in the separators list,
they will be categorized as TokenKind::StopWord or TokenKind::Separator as well.

Other changes

Update Lindera to 0.24.0 (#212) @mosuka
Transform classifier into normalizer (#214) @ManyTheFish
Handle underscore similarly to dash (#216) @vvv
Enhance Japanese Tokenization (#218) @mosuka
Add helper methods (#222) @ManyTheFish

Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], @mosuka and @vvv! 🎉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charabia v0.8.0

Changes

Main Changes

Add options to customize the Tokenizer's segmentation (#215)

Other changes

Contributors