Add a citation detector #119

** Why are these changes being introduced: A certain percentage of our search traffic is made up of formal citations to existing works, in a variety of formats. It would be good to have a detector to identify these and pluck them out consistently for further work (reconciliation, re-formatting, etc) ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-97 Also TCO-96 and TCO-95 get some benefit from this. ** How does this address that need: This adds a new Detector::Citation class, with attendant changes to the seeds and test fixtures. This is a different type of detector than we've written in the past, using a multi-layer approach that first compiles some discrete small information using regexes and counts, which are then assessed by a second routine that calculates a final score. Terms which score high enough can have a Detection registered using our usual workflow. The smaller discrete signals were designed after looking over examples of five different citation formats: MLA, APA, Chicago, Terabian, and IEEE. Examples of these patterns include formats for volume, issue, page ranges, quoted titles, and name formatting. These are implemented using regular expressions. A second set of discrete signals are generated using counts, by looking at how many characters, words, and specific symbols are found in the search string (commas, periods, and other potential separators). Each of these counts are compared to a threshold value, so that if enough of them are in the term then the citation score gets raised. While I feel okay about the overall structure of this detector, the specific thresholds I'm using probably need to be verified against real world data. I have some ideas about how to pursue this in the future, as a refinement ticket later on. ** Document any side effects to this change: * While there are similarities between this detector and the structure of the StandardIdentifiers detector, I've chosen to vary some parts of the approach as well (using scan rather than match, for example, or defining the regexes using a constant). Ultimately I think we should probably have a standardized approach, but for now I think some variation might help us compare and contrast between them.

Commits on Oct 11, 2024

Add detection? convenience method

matt-bernhardt committed Oct 11, 2024

Configuration menu

View commit details

Copy full SHA for fa36a2a

Browse repository at this point

Copy the full SHA

fa36a2a View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a citation detector #119

Add a citation detector #119

Commits on Oct 10, 2024

Commits on Oct 11, 2024