-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Sentence
struct, replace Vec<Token> with Sentence where possible
#54
Conversation
…completeSentence for Vec<IncompleteToken>
This is ready for a review @drahnr if you have the time. It turned out more substantial than I initially thought. In addition to the things mentioned above:
There's a big diff because of many trivial changes, most of the interesting ones are in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like it was quite a bit of work, so much appreciated!
A few nitpicks which are intertwined:
.span().char().start()
does add quite a bit of visual clutter, I would recommend to go for fn char_range() -> Range<usize> { .. }
which allows direct modification of the inner types, so .char_range().start
could be used instead and avoid 2x ()
all over the place.
I think a few redundant calls to .chars().count()
could (and eventually should) be avoided.
But again, just nits 👍
Thanks a lot for the review! I'm a bit conflicted regarding
so I think I'll keep it as it is. I'll look over this PR once more tomorrow and merge it then. Also if you haven't seen it already you might be interested in #50 (comment), an update regarding the spellchecking discussion. |
…ble (bminixhofer#54) * replace Vec<Token> with new Sentence struct where possible (+ with IncompleteSentence for Vec<IncompleteToken>) * separate match sentence and match graph, reduce dependents on tokenizer * fix missing SENT_START special case, debug impls for WordId, PosId * make MatchSentence private, docs * use new Span struct for byte and char ranges * fix PartialOrd impl on Position, get_token_str -> get_token_ranges
This PR shifts the focus from
Vec<Token>
to a newSentence
struct and moves the fields on theToken
which semantically belonged to the sentence (sentence
andtagger
) to the sentence. Includes some general improvements like:Sentence
/IncompleteSentence
instead of aVec<Vec<Token>>
/Vec<Vec<IncompleteToken>>
.byte_span
/char_span
) by making them relative to the input text (as opposed to relative to each sentence, as currently the case) and making them aRange
instead of(usize, usize)
.Future work in this direction would include rethinking the distinction between
IncompleteSentence
/Sentence
and possibly unifying them or removing the need for one of them, but that's out of scope here.