-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with sentence segmentation offsets #753
Comments
I haven't tested but I think this could be fixed by PR #701
|
Just tested... PR #701 does not fix it unfortunately, same error. |
I've added some tests and the output of the method I did some debugging, trying to understand where the issue is, and IMHO is at line ~244 of SentenceUtilities, where the upper bound is increased.
In this specific case, the synchronisation between layout token and sentences seems containing the problem. |
I'm revising this issue and the somehow related PR #701. The comment states that we need to compose the "text" without the forbidden elements (references), however IMHO we should keep these references in the text as well, run the segmentation and then remove them, isn't it? I notice also that while the layout tokens contain all the token (including references), the text is a mixture:
|
(removing comment, it was more for #811 !) but it's relevant to the fact that we don't remove references, just keep track of the positions. The text at this stage is not modified after segmentation. The rest of the method is re-injecting the tags in the segmented text, but don't touch the text. |
Ok, so the forbidden spans and the text are in sync. I've pushed a test (currently ignored) that reproduce the issue. The problem seems to be generated at this point: grobid/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java Line 234 in cdb52ad
The layout token inspection ends up not in sync with the sentences, as you mentioned before. The layout token at index 25 has "superscript" = true (I set it explicitly in the test, without it would work) and is causing the chain reaction. Although, it is inspected only when we are at the sentence with index = 4, the one that appears with incorrect positions. This following is unrelated.
|
Since I've did some swimming in this part of the code, I've checked again with a fresh mind. It seems that the footnote |
In the following example
https://arxiv.org/pdf/2103.12028v1.pdf
there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:
As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.
The text was updated successfully, but these errors were encountered: