Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFD form combining characters not picked up as part of word #1099

Open
retorquere opened this issue Mar 25, 2024 · 3 comments
Open

NFD form combining characters not picked up as part of word #1099

retorquere opened this issue Mar 25, 2024 · 3 comments

Comments

@retorquere
Copy link

retorquere commented Mar 25, 2024

function show(s) {
  return s.replace(/[^\x00-\x7F]/g, c => "\\u" + ("0000" + c.charCodeAt(0).toString(16)).slice(-4))
}
var nlp = require("compromise/one")
var doc = nlp('Poincare\u0301')
for (const term of doc.json({offset:true})[0].terms) {
  console.log(show(JSON.stringify(term, null, 2)))
}

logs

{
  "text": "Poincare",
  "pre": "",
  "post": "\u0301",
  "tags": [],
  "normal": "poincare",
  "index": [
    0,
    0
  ],
  "id": "poincare|002000009",
  "offset": {
    "index": 0,
    "start": 0,
    "length": 8
  }
}

normalizing to NFC does work, but not every combining char combination has an NFC form (eg 'Poincare\u0301 E\u0300\u0304'.normalize('NFC'))

@spencermountain
Copy link
Owner

hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release.
cheers

@spencermountain
Copy link
Owner

hey, just double-checking something, your example Poincare\u0301 seems to be a punctuation symbol '́' - which arguably should be considered non-word whitepsace maybe.

Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things.
lemme know,
cheers

@spencermountain spencermountain added hmmm and removed yesss labels Apr 1, 2024
@retorquere
Copy link
Author

It's just the Combining Acute Accent:

const show = obj => JSON.stringify(obj, null, 2).replace(/[\u007F-\uFFFF]/g, chr => `\\u${(`0000${chr.charCodeAt(0).toString(16)}`).substr(-4)}`)
console.log(show(`e\u0301`.normalize('NFC')))

shows

"\u00e9"

it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants