You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm leaning towards segmenting on all unicode space characters.
Pros of segmenting on other unicode space characters:
Users can't accidentally use a wrong space character, which would lead to a possibly hard-to-interpret memory error because of the size of the softmax. Even if not, the model won't converge and they won't know why.
Cons of segmenting on other unicode space characters:
... I guess it means users can't have special spaces inside tokens. Not sure why you'd want to do that though, and a hyphen or another symbol could always be used instead if its really desired.
Note that regardless of the choice its important that if users want to explictly predict spaces (in character prediction), then that is accounted for. Probably best with a flag to segment_into_chars() or something similar, which would generate special tokens that represent spaces, such as underscores, for training and decoding. These then would get removed as a postprocessing step.
From issue #151 you made a good observation about the behavior of segmentation in Corpus.__init__:
Corpus.__init__() expects that the text was already segmented with spaces and just complains if there is an inconsistency when labels is passed in
So we will have to address that the corpus expects whitespace (currently only spaces IIRC) to not be within labels if we decide to allow space characters. This could end up being confusing if it's not consistent with the behavior elsewhere.
Right now Unicode space characters such as 'NO-BREAK SPACE' (U+00A0) don't get split on.
I propose we decide on what behavior we expect here and resolve this in PR #213.
I made a test case in https://github.com/persephone-tools/persephone/pull/213/commits/07898f8e9d9d455127a937f03054dc5fd16fac21that has some common space characters. I think the right place to deal with it is here, we could alternatively deal with it in the frontend, but then if we wanted to make that work accessible upstream we would effectively have to duplicate the code.
The text was updated successfully, but these errors were encountered: