Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation of characters with unicode space characters #214

Open
shuttle1987 opened this issue Dec 27, 2018 · 2 comments
Open

Segmentation of characters with unicode space characters #214

shuttle1987 opened this issue Dec 27, 2018 · 2 comments

Comments

@shuttle1987
Copy link
Member

Right now Unicode space characters such as 'NO-BREAK SPACE' (U+00A0) don't get split on.

I propose we decide on what behavior we expect here and resolve this in PR #213.

I made a test case in https://github.com/persephone-tools/persephone/pull/213/commits/07898f8e9d9d455127a937f03054dc5fd16fac21that has some common space characters. I think the right place to deal with it is here, we could alternatively deal with it in the frontend, but then if we wanted to make that work accessible upstream we would effectively have to duplicate the code.

@oadams
Copy link
Collaborator

oadams commented Dec 29, 2018

I'm leaning towards segmenting on all unicode space characters.

Pros of segmenting on other unicode space characters:

  • Users can't accidentally use a wrong space character, which would lead to a possibly hard-to-interpret memory error because of the size of the softmax. Even if not, the model won't converge and they won't know why.

Cons of segmenting on other unicode space characters:

  • ... I guess it means users can't have special spaces inside tokens. Not sure why you'd want to do that though, and a hyphen or another symbol could always be used instead if its really desired.

Note that regardless of the choice its important that if users want to explictly predict spaces (in character prediction), then that is accounted for. Probably best with a flag to segment_into_chars() or something similar, which would generate special tokens that represent spaces, such as underscores, for training and decoding. These then would get removed as a postprocessing step.

@shuttle1987
Copy link
Member Author

From issue #151 you made a good observation about the behavior of segmentation in Corpus.__init__:

Corpus.__init__() expects that the text was already segmented with spaces and just complains if there is an inconsistency when labels is passed in

So we will have to address that the corpus expects whitespace (currently only spaces IIRC) to not be within labels if we decide to allow space characters. This could end up being confusing if it's not consistent with the behavior elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants