** Why are these changes being introduced:
A certain percentage of our search traffic is made up of formal
citations to existing works, in a variety of formats. It would be good
to have a detector to identify these and pluck them out consistently for
further work (reconciliation, re-formatting, etc)
** Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/tco-97
Also TCO-96 and TCO-95 get some benefit from this.
** How does this address that need:
This adds a new Detector::Citation class, with attendant changes to the
seeds and test fixtures. This is a different type of detector than we've
written in the past, using a multi-layer approach that first compiles
some discrete small information using regexes and counts, which are then
assessed by a second routine that calculates a final score. Terms which
score high enough can have a Detection registered using our usual
workflow.
The smaller discrete signals were designed after looking over examples
of five different citation formats: MLA, APA, Chicago, Terabian, and
IEEE. Examples of these patterns include formats for volume, issue, page
ranges, quoted titles, and name formatting. These are implemented using
regular expressions.
A second set of discrete signals are generated using counts, by looking
at how many characters, words, and specific symbols are found in the
search string (commas, periods, and other potential separators). Each of
these counts are compared to a threshold value, so that if enough of
them are in the term then the citation score gets raised.
While I feel okay about the overall structure of this detector, the
specific thresholds I'm using probably need to be verified against real
world data. I have some ideas about how to pursue this in the future, as
a refinement ticket later on.
** Document any side effects to this change:
* While there are similarities between this detector and the structure
of the StandardIdentifiers detector, I've chosen to vary some parts of
the approach as well (using scan rather than match, for example, or
defining the regexes using a constant). Ultimately I think we should
probably have a standardized approach, but for now I think some
variation might help us compare and contrast between them.