You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
function show(s) {
return s.replace(/[^\x00-\x7F]/g, c => "\\u" + ("0000" + c.charCodeAt(0).toString(16)).slice(-4))
}
var nlp = require("compromise/one")
var doc = nlp('Poincare\u0301')
for (const term of doc.json({offset:true})[0].terms) {
console.log(show(JSON.stringify(term, null, 2)))
}
hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release.
cheers
hey, just double-checking something, your example Poincare\u0301 seems to be a punctuation symbol '́' - which arguably should be considered non-word whitepsace maybe.
Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things.
lemme know,
cheers
it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.
logs
normalizing to NFC does work, but not every combining char combination has an NFC form (eg
'Poincare\u0301 E\u0300\u0304'.normalize('NFC')
)The text was updated successfully, but these errors were encountered: