Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a citation detector #119

Merged
merged 2 commits into from
Oct 11, 2024
Merged

Add a citation detector #119

merged 2 commits into from
Oct 11, 2024

Commits on Oct 10, 2024

  1. Adds a citation detector

    ** Why are these changes being introduced:
    
    A certain percentage of our search traffic is made up of formal
    citations to existing works, in a variety of formats. It would be good
    to have a detector to identify these and pluck them out consistently for
    further work (reconciliation, re-formatting, etc)
    
    ** Relevant ticket(s):
    
    * https://mitlibraries.atlassian.net/browse/tco-97
    
    Also TCO-96 and TCO-95 get some benefit from this.
    
    ** How does this address that need:
    
    This adds a new Detector::Citation class, with attendant changes to the
    seeds and test fixtures. This is a different type of detector than we've
    written in the past, using a multi-layer approach that first compiles
    some discrete small information using regexes and counts, which are then
    assessed by a second routine that calculates a final score. Terms which
    score high enough can have a Detection registered using our usual
    workflow.
    
    The smaller discrete signals were designed after looking over examples
    of five different citation formats: MLA, APA, Chicago, Terabian, and
    IEEE. Examples of these patterns include formats for volume, issue, page
    ranges, quoted titles, and name formatting. These are implemented using
    regular expressions.
    
    A second set of discrete signals are generated using counts, by looking
    at how many characters, words, and specific symbols are found in the
    search string (commas, periods, and other potential separators). Each of
    these counts are compared to a threshold value, so that if enough of
    them are in the term then the citation score gets raised.
    
    While I feel okay about the overall structure of this detector, the
    specific thresholds I'm using probably need to be verified against real
    world data. I have some ideas about how to pursue this in the future, as
    a refinement ticket later on.
    
    ** Document any side effects to this change:
    
    * While there are similarities between this detector and the structure
    of the StandardIdentifiers detector, I've chosen to vary some parts of
    the approach as well (using scan rather than match, for example, or
    defining the regexes using a constant). Ultimately I think we should
    probably have a standardized approach, but for now I think some
    variation might help us compare and contrast between them.
    matt-bernhardt committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    ed39b74 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2024

  1. Configuration menu
    Copy the full SHA
    fa36a2a View commit details
    Browse the repository at this point in the history