Charabia v0.8.0
meili-bot
released this
29 Jun 11:36
·
222 commits
to refs/heads/main
since this release
Changes
Main Changes
Add options to customize the Tokenizer's segmentation (#215)
A new separators
method has been added to the TokenizerBuilder
allowing to customize the separators that will be used to segment a text:
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];
//configure separators.
builder.separators(&separators);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);
A new words_dict
method has been added to the TokenizerBuilder
allowing to override the segmentation over some words:
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];
//configure words.
builder.words_dict(&words);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);
This words dictionary is used to override the segmentation over the dictionary's words,
the tokenizer will find all the occurrences of these words before any Language based segmentation.
If some of the terms are in the stop_words list or in the separators list,
they will be categorized as TokenKind::StopWord
or TokenKind::Separator
as well.
Other changes
- Update Lindera to 0.24.0 (#212) @mosuka
- Transform classifier into normalizer (#214) @ManyTheFish
- Handle underscore similarly to dash (#216) @vvv
- Enhance Japanese Tokenization (#218) @mosuka
- Add helper methods (#222) @ManyTheFish
Thanks again to @ManyTheFish, @dependabot, @dependabot[bot], @meili-bors[bot], @mosuka and @vvv! 🎉