Add a `Sentence` struct, replace Vec<Token> with Sentence where possible #54

bminixhofer · 2021-03-25T16:59:21Z

This PR shifts the focus from Vec<Token> to a new Sentence struct and moves the fields on the Token which semantically belonged to the sentence (sentence and tagger) to the sentence. Includes some general improvements like:

Returning an iterator over Sentence / IncompleteSentence instead of a Vec<Vec<Token>> / Vec<Vec<IncompleteToken>>.
Fixing indices on the token (byte_span / char_span) by making them relative to the input text (as opposed to relative to each sentence, as currently the case) and making them a Range instead of (usize, usize).

Future work in this direction would include rethinking the distinction between IncompleteSentence / Sentence and possibly unifying them or removing the need for one of them, but that's out of scope here.

…completeSentence for Vec<IncompleteToken>

bminixhofer · 2021-03-27T16:17:27Z

This is ready for a review @drahnr if you have the time. It turned out more substantial than I initially thought. In addition to the things mentioned above:

there is now Span and Position structs which track the byte and char indices at the same time. (so instead of token.char_span() you now write token.span().char()).
fields on the Suggestion, IncompleteToken and Token are now private and have getters.
there is a private new MatchSentence struct which represents a sentence prepared for matching on the rules. So the special SENT_START token is not visible in the public API anymore and each token in the Sentence and IncompleteSentence corresponds to at least one character in the input text.

There's a big diff because of many trivial changes, most of the interesting ones are in the tokenizer.rs and types.rs. Again, I'd especially appreciate feedback regarding the changes and additions to the public API. Thanks!

drahnr

That looks like it was quite a bit of work, so much appreciated!

A few nitpicks which are intertwined:
.span().char().start() does add quite a bit of visual clutter, I would recommend to go for fn char_range() -> Range<usize> { .. } which allows direct modification of the inner types, so .char_range().start could be used instead and avoid 2x () all over the place.

I think a few redundant calls to .chars().count() could (and eventually should) be avoided.

But again, just nits 👍

build/src/lib.rs

nlprule/src/rule/engine/composition.rs

nlprule/src/rule/engine/mod.rs

nlprule/src/rule/grammar.rs

nlprule/src/tokenizer.rs

bminixhofer · 2021-03-28T16:50:55Z

Thanks a lot for the review!

I'm a bit conflicted regarding .char_range and .byte_range. Adding them would allow us to make .span pub(crate) which avoids the naming problem but it means:

trying to enforce that .char_range and byte_range are never used internally (internally, .span() should be used since it allows e.g. getting the starting byte and char index with .span().start()).
Code duplication by adding a char_range and byte_range method to every struct that has a Span (Suggestion, Sentence, IncompleteSentence, IncompleteToken, Token).
It makes some things which are easy to do with the Span API harder e.g. getting the start byte and char index as Position is .span().start() but would need .char_range().start and .byte_range().start otherwise. This is then easy to pass to e.g. rshift on the Suggestion which needs a Position.

so I think I'll keep it as it is. I'll look over this PR once more tomorrow and merge it then. Also if you haven't seen it already you might be interested in #50 (comment), an update regarding the spellchecking discussion.

…ble (bminixhofer#54) * replace Vec<Token> with new Sentence struct where possible (+ with IncompleteSentence for Vec<IncompleteToken>) * separate match sentence and match graph, reduce dependents on tokenizer * fix missing SENT_START special case, debug impls for WordId, PosId * make MatchSentence private, docs * use new Span struct for byte and char ranges * fix PartialOrd impl on Position, get_token_str -> get_token_ranges

bminixhofer mentioned this pull request Mar 25, 2021

Token as returned by pipe() is relative to the sentence boundaries #53

Closed

bminixhofer added 2 commits March 26, 2021 09:44

replace Vec<Token> with new Sentence struct where possible (+ with In…

41e517f

…completeSentence for Vec<IncompleteToken>

separate match sentence and match graph, reduce dependents on tokenizer

199978f

bminixhofer force-pushed the sentence branch from 528ac1c to 199978f Compare March 26, 2021 08:45

Benjamin Minixhofer and others added 5 commits March 26, 2021 14:40

fix missing SENT_START special case, debug impls for WordId, PosId

5edfa3e

make MatchSentence private, docs

0d278dc

use new Span struct for byte and char ranges

3da9ac4

fix tests

dd38402

fix out of bonds accesses, improve docs and tests

fb8d1ba

bminixhofer marked this pull request as ready for review March 27, 2021 14:40

fix clippy lints

a04594c

one more lint

a6784c4

drahnr approved these changes Mar 27, 2021

View reviewed changes

address review comments

cb1c3c1

bminixhofer linked an issue Mar 28, 2021 that may be closed by this pull request

Token as returned by pipe() is relative to the sentence boundaries #53

Closed

fix PartialOrd impl on Position, get_token_str -> get_token_ranges

faff863

bminixhofer merged commit 17fe2b6 into main Mar 30, 2021

bminixhofer deleted the sentence branch April 1, 2021 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `Sentence` struct, replace Vec<Token> with Sentence where possible #54

Add a `Sentence` struct, replace Vec<Token> with Sentence where possible #54

bminixhofer commented Mar 25, 2021 •

edited

Loading

bminixhofer commented Mar 27, 2021 •

edited

Loading

drahnr left a comment

bminixhofer commented Mar 28, 2021

Add a Sentence struct, replace Vec<Token> with Sentence where possible #54

Add a Sentence struct, replace Vec<Token> with Sentence where possible #54

Conversation

bminixhofer commented Mar 25, 2021 • edited Loading

bminixhofer commented Mar 27, 2021 • edited Loading

drahnr left a comment

Choose a reason for hiding this comment

bminixhofer commented Mar 28, 2021

Add a `Sentence` struct, replace Vec<Token> with Sentence where possible #54

Add a `Sentence` struct, replace Vec<Token> with Sentence where possible #54

bminixhofer commented Mar 25, 2021 •

edited

Loading

bminixhofer commented Mar 27, 2021 •

edited

Loading