Segmentation of characters with unicode space characters #214

shuttle1987 · 2018-12-27T11:03:36Z

Right now Unicode space characters such as 'NO-BREAK SPACE' (U+00A0) don't get split on.

I propose we decide on what behavior we expect here and resolve this in PR #213.

I made a test case in https://github.com/persephone-tools/persephone/pull/213/commits/07898f8e9d9d455127a937f03054dc5fd16fac21that has some common space characters. I think the right place to deal with it is here, we could alternatively deal with it in the frontend, but then if we wanted to make that work accessible upstream we would effectively have to duplicate the code.

oadams · 2018-12-29T21:35:26Z

I'm leaning towards segmenting on all unicode space characters.

Pros of segmenting on other unicode space characters:

Users can't accidentally use a wrong space character, which would lead to a possibly hard-to-interpret memory error because of the size of the softmax. Even if not, the model won't converge and they won't know why.

Cons of segmenting on other unicode space characters:

... I guess it means users can't have special spaces inside tokens. Not sure why you'd want to do that though, and a hyphen or another symbol could always be used instead if its really desired.

Note that regardless of the choice its important that if users want to explictly predict spaces (in character prediction), then that is accounted for. Probably best with a flag to segment_into_chars() or something similar, which would generate special tokens that represent spaces, such as underscores, for training and decoding. These then would get removed as a postprocessing step.

shuttle1987 · 2018-12-30T05:43:34Z

From issue #151 you made a good observation about the behavior of segmentation in Corpus.__init__:

Corpus.__init__() expects that the text was already segmented with spaces and just complains if there is an inconsistency when labels is passed in

So we will have to address that the corpus expects whitespace (currently only spaces IIRC) to not be within labels if we decide to allow space characters. This could end up being confusing if it's not consistent with the behavior elsewhere.

shuttle1987 added bug question labels Dec 27, 2018

shuttle1987 mentioned this issue Dec 27, 2018

[MRG] label segmentation on whitespace #213

Merged

alexis-michaud mentioned this issue Feb 10, 2019

adding word boundaries to the acoustic model for Na #210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation of characters with unicode space characters #214

Segmentation of characters with unicode space characters #214

shuttle1987 commented Dec 27, 2018

oadams commented Dec 29, 2018

shuttle1987 commented Dec 30, 2018

Segmentation of characters with unicode space characters #214

Segmentation of characters with unicode space characters #214

Comments

shuttle1987 commented Dec 27, 2018

oadams commented Dec 29, 2018

shuttle1987 commented Dec 30, 2018