-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expand scope of arxiv identifier matcher, and fix some training data annotations #858
base: master
Are you sure you want to change the base?
Conversation
The simple part of this is allowing 'arxiv:' in addition to 'arXiv:'. The more complex second part is to conservatively match "old" (pre-2008) style identifiers which do not have a prefix. The conservative matching is because there is less confidence that a string is actually an arxiv identifier without the prefix. Explicit collection prefixes are included (for those that existed pre-2008), internal whitespace is not allowed, and the identifier must be separated from other alphabetic strings.
…fiers These old-style arxiv identifiers have no prefix ("arxiv:"), but are unambiguously arxiv.org identifiers.
@bnewbold Thanks a lot for the improvement ! I think there are small issues in the new regex and I propose some changes in the review after some tests, if you could have a look? |
@kermitt2 thanks for reviewing. I don't see any proposed changes or review notes here on github. Maybe I am missing something? |
static public final Pattern arXivPattern = Pattern | ||
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)"); | ||
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing a bit the regex:
- there is a double
??
in the first component which is a typo I think? - better avoid having the captured parenthesis for the sub-term
(math|hep|astro|cond|gr|nucl|quat|stat|ph...)
, we could add a non captured parenthesis -
- I don't understand the
[^a-zA-Z]
before the(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)
, it makeshep-lat/0509026
failing for example as it expects something before the arXiv identifier?
- I don't understand the
Suggested regex:
(ar[xX]iv\s?(\.org)?\s?\:\s?\d{4}\s?\.\s?\d{4,5}(v\d+)?)|(ar[xX]iv\s?(\.org)?\s?\:\s?[ a-zA-Z\-\.]{3,16}\s?/\s?\d{7}(v\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\-bio|q\-fin)[a-zA-Z\-\.]*/\d{7}(v\d+)?)
In java syntax:
(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)
We then have only one captured group for one arXiv identifier.
static public final Pattern arXivPattern = Pattern | ||
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)"); | ||
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py#L12 we have much more categories possible. Like for the sub-categories, we might want to simply have a free range of characters?
e.g. instead of
(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*
having simply:
[a-zA-Z\\-\\.]*
I am a bit worry then about the robustness of the regex for ill-formed input wrt. catastrophic backtracking as we have seen elsewhere. Maybe best alternative would be to enumerate all the possibilities?
Ok sorry I think I did not submit the review, can you see my comments now? |
The current arxiv.org identifier matcher regex requires an "arXiv:" prefix. This misses some un-ambiguous old-style identifiers in short citations, like these examples;
This PR extends the regex to allow these cases, by more conservatively matching the "section" prefix, and not allowing whitespace, to reduce the risk of matching on random garbled text. The above examples are added as citation training data.
I also noticed a handful of existing training citations which were missing
type="arxiv"
on the<idno>
tag, so I added those.There is some possible ambiguity about
arxiv
vs.arXiv
in the TEI data: GROBID output has the capitalized X, while the training data is lower-case. Also, while all (or almost all) the citation training data has these annotations, none of the header<idno>
annotations do. Not sure if it would be helpful to add those.This is related to internetarchive/fatcat#84 and to #275