NFD form combining characters not picked up as part of word #1099

retorquere · 2024-03-25T21:56:27Z

function show(s) {
  return s.replace(/[^\x00-\x7F]/g, c => "\\u" + ("0000" + c.charCodeAt(0).toString(16)).slice(-4))
}
var nlp = require("compromise/one")
var doc = nlp('Poincare\u0301')
for (const term of doc.json({offset:true})[0].terms) {
  console.log(show(JSON.stringify(term, null, 2)))
}

logs

{
  "text": "Poincare",
  "pre": "",
  "post": "\u0301",
  "tags": [],
  "normal": "poincare",
  "index": [
    0,
    0
  ],
  "id": "poincare|002000009",
  "offset": {
    "index": 0,
    "start": 0,
    "length": 8
  }
}

normalizing to NFC does work, but not every combining char combination has an NFC form (eg 'Poincare\u0301 E\u0300\u0304'.normalize('NFC'))

The text was updated successfully, but these errors were encountered:

spencermountain · 2024-03-28T17:05:12Z

hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release.
cheers

spencermountain · 2024-04-01T16:51:00Z

hey, just double-checking something, your example Poincare\u0301 seems to be a punctuation symbol '́' - which arguably should be considered non-word whitepsace maybe.

Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things.
lemme know,
cheers

retorquere · 2024-04-01T18:10:36Z

It's just the Combining Acute Accent:

const show = obj => JSON.stringify(obj, null, 2).replace(/[\u007F-\uFFFF]/g, chr => `\\u${(`0000${chr.charCodeAt(0).toString(16)}`).substr(-4)}`)
console.log(show(`e\u0301`.normalize('NFC')))

shows

"\u00e9"

it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.

spencermountain added enhancement yesss labels Mar 28, 2024

spencermountain added hmmm and removed yesss labels Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFD form combining characters not picked up as part of word #1099

NFD form combining characters not picked up as part of word #1099

retorquere commented Mar 25, 2024 •

edited

Loading

spencermountain commented Mar 28, 2024

spencermountain commented Apr 1, 2024

retorquere commented Apr 1, 2024

NFD form combining characters not picked up as part of word #1099

NFD form combining characters not picked up as part of word #1099

Comments

retorquere commented Mar 25, 2024 • edited Loading

spencermountain commented Mar 28, 2024

spencermountain commented Apr 1, 2024

retorquere commented Apr 1, 2024

retorquere commented Mar 25, 2024 •

edited

Loading