Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Sentence struct, replace Vec<Token> with Sentence where possible #54

Merged
merged 11 commits into from
Mar 30, 2021

Conversation

bminixhofer
Copy link
Owner

@bminixhofer bminixhofer commented Mar 25, 2021

This PR shifts the focus from Vec<Token> to a new Sentence struct and moves the fields on the Token which semantically belonged to the sentence (sentence and tagger) to the sentence. Includes some general improvements like:

  • Returning an iterator over Sentence / IncompleteSentence instead of a Vec<Vec<Token>> / Vec<Vec<IncompleteToken>>.
  • Fixing indices on the token (byte_span / char_span) by making them relative to the input text (as opposed to relative to each sentence, as currently the case) and making them a Range instead of (usize, usize).

Future work in this direction would include rethinking the distinction between IncompleteSentence / Sentence and possibly unifying them or removing the need for one of them, but that's out of scope here.

@bminixhofer bminixhofer marked this pull request as ready for review March 27, 2021 14:40
@bminixhofer
Copy link
Owner Author

bminixhofer commented Mar 27, 2021

This is ready for a review @drahnr if you have the time. It turned out more substantial than I initially thought. In addition to the things mentioned above:

  • there is now Span and Position structs which track the byte and char indices at the same time. (so instead of token.char_span() you now write token.span().char()).
  • fields on the Suggestion, IncompleteToken and Token are now private and have getters.
  • there is a private new MatchSentence struct which represents a sentence prepared for matching on the rules. So the special SENT_START token is not visible in the public API anymore and each token in the Sentence and IncompleteSentence corresponds to at least one character in the input text.

There's a big diff because of many trivial changes, most of the interesting ones are in the tokenizer.rs and types.rs. Again, I'd especially appreciate feedback regarding the changes and additions to the public API. Thanks!

Copy link
Contributor

@drahnr drahnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like it was quite a bit of work, so much appreciated!

A few nitpicks which are intertwined:
.span().char().start() does add quite a bit of visual clutter, I would recommend to go for fn char_range() -> Range<usize> { .. } which allows direct modification of the inner types, so .char_range().start could be used instead and avoid 2x () all over the place.

I think a few redundant calls to .chars().count() could (and eventually should) be avoided.

But again, just nits 👍

build/src/lib.rs Show resolved Hide resolved
nlprule/src/rule/engine/composition.rs Show resolved Hide resolved
nlprule/src/rule/engine/mod.rs Outdated Show resolved Hide resolved
nlprule/src/rule/grammar.rs Outdated Show resolved Hide resolved
nlprule/src/tokenizer.rs Outdated Show resolved Hide resolved
nlprule/src/tokenizer.rs Show resolved Hide resolved
nlprule/src/tokenizer.rs Show resolved Hide resolved
nlprule/src/tokenizer.rs Show resolved Hide resolved
@bminixhofer
Copy link
Owner Author

Thanks a lot for the review!

I'm a bit conflicted regarding .char_range and .byte_range. Adding them would allow us to make .span pub(crate) which avoids the naming problem but it means:

  • trying to enforce that .char_range and byte_range are never used internally (internally, .span() should be used since it allows e.g. getting the starting byte and char index with .span().start()).
  • Code duplication by adding a char_range and byte_range method to every struct that has a Span (Suggestion, Sentence, IncompleteSentence, IncompleteToken, Token).
  • It makes some things which are easy to do with the Span API harder e.g. getting the start byte and char index as Position is .span().start() but would need .char_range().start and .byte_range().start otherwise. This is then easy to pass to e.g. rshift on the Suggestion which needs a Position.

so I think I'll keep it as it is. I'll look over this PR once more tomorrow and merge it then. Also if you haven't seen it already you might be interested in #50 (comment), an update regarding the spellchecking discussion.

@bminixhofer bminixhofer linked an issue Mar 28, 2021 that may be closed by this pull request
@bminixhofer bminixhofer merged commit 17fe2b6 into main Mar 30, 2021
@bminixhofer bminixhofer deleted the sentence branch April 1, 2021 17:35
drahnr pushed a commit to drahnr/nlprule that referenced this pull request Apr 7, 2021
…ble (bminixhofer#54)

* replace Vec<Token> with new Sentence struct where possible (+ with IncompleteSentence for Vec<IncompleteToken>)

* separate match sentence and match graph, reduce dependents on tokenizer

* fix missing SENT_START special case, debug impls for WordId, PosId

* make MatchSentence private, docs

* use new Span struct for byte and char ranges

* fix PartialOrd impl on Position, get_token_str -> get_token_ranges
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Token as returned by pipe() is relative to the sentence boundaries
2 participants