Add a citation detector #119

matt-bernhardt · 2024-10-09T20:02:45Z

This adds a new type of Detector to the application, a Citation detector that is an example of a more complex pattern detector. It starts by performing a series of individual calculations on the term, building up intermediate signals which are then combined in a final step to calculate a single numeric score. If that score is beyond a defined threshold, the detector is able to register a Detection.

Because this Detector does not do anything beyond calculating the score - for example, it does not confirm whether the citation is to a real article, or look up anything based on that detection, I've elected not to incoporate it into the GraphQL response. I'm not sure that any of the output of the model would make sense to send back to a consuming application - the subpatterns, summary, and score values don't seem to have much use outside TACOS. Maybe they shouldn't be instance variables at all?

As part of this work, I've also written a small rake task that would run this process for all Terms in the system, writing a CSV file that could be useful for offline analysis of its operation. I'm happy to include that in this PR if you think it would be helpful, but it isn't needed for normal operations, so I haven't included it here.

Future detectors might be able to make use of the summary counts or some sub-patterns which I've defined in this class. In that case, I think they could be moved to their own classes - but for now I've left them here to make this work simpler.

Developer

Ticket(s)

This will close https://mitlibraries.atlassian.net/browse/TCO-97
It will also contribute towards two other tickets:
https://mitlibraries.atlassian.net/browse/TCO-95 (extraction-style detector)
https://mitlibraries.atlassian.net/browse/TCO-96 (base detector)

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

** Why are these changes being introduced: A certain percentage of our search traffic is made up of formal citations to existing works, in a variety of formats. It would be good to have a detector to identify these and pluck them out consistently for further work (reconciliation, re-formatting, etc) ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-97 Also TCO-96 and TCO-95 get some benefit from this. ** How does this address that need: This adds a new Detector::Citation class, with attendant changes to the seeds and test fixtures. This is a different type of detector than we've written in the past, using a multi-layer approach that first compiles some discrete small information using regexes and counts, which are then assessed by a second routine that calculates a final score. Terms which score high enough can have a Detection registered using our usual workflow. The smaller discrete signals were designed after looking over examples of five different citation formats: MLA, APA, Chicago, Terabian, and IEEE. Examples of these patterns include formats for volume, issue, page ranges, quoted titles, and name formatting. These are implemented using regular expressions. A second set of discrete signals are generated using counts, by looking at how many characters, words, and specific symbols are found in the search string (commas, periods, and other potential separators). Each of these counts are compared to a threshold value, so that if enough of them are in the term then the citation score gets raised. While I feel okay about the overall structure of this detector, the specific thresholds I'm using probably need to be verified against real world data. I have some ideas about how to pursue this in the future, as a refinement ticket later on. ** Document any side effects to this change: * While there are similarities between this detector and the structure of the StandardIdentifiers detector, I've chosen to vary some parts of the approach as well (using scan rather than match, for example, or defining the regexes using a constant). Ultimately I think we should probably have a standardized approach, but for now I think some variation might help us compare and contrast between them.

matt-bernhardt · 2024-10-10T18:43:02Z

Oh, @JPrevost - a question that occurs to me now, but which may just complicate things: even though this citation detector doesn't provide any information to the GraphQL endpoint, should there be a test in the GraphQL controller to confirm that traffic coming in from the API does actually result in this Detector being tripped? That feels like an integration test, rather than a test of the GraphQL endpoint itself?

I'm happy to write up a test suite like this if you'd like.

JPrevost

Overall this looks really good and is very interesting.

I added a couple of comments that I'd like to let you decide how to address (or not) before finalizing the review as they could potentially change enough that I'd want to start my review over. Neither are blocking, so if you want to leave them as-is please let me know and I'll complete the review with the code as-is.

JPrevost · 2024-10-10T19:35:09Z

app/models/detector/citation.rb

+    # @note This method can be called directly via Detector::Citation.new(Term). It is also called indirectly via the
+    #       Detector::Citation.record(Term) instance method. This method can be called directly when a Detection is not
+    #       desired.
+    def initialize(term)


I've been thinking about a future state of having tacos run in "not logging mode". Having our detectors work based on Strings instead of Terms would be key to that potential functionality.

The .record methods could still use the Term as those are explicitly part of our write mode processes.

I'm sharing this just as information on some future direction we may head. If you prefer to leave this initialize method using the Term that is fine. Some of our Detectors use Terms and some use Strings. Eventually I want to normalize that as it is a type of technical debt to have them work differently for now good reason. As we haven't formally picked a path forward, either Term is String is acceptable, but I'm leaning towards proposing we adopt Strings for all Detectors even if we use Terms for record to give us the flexibility checking how TACOS would respond to input without having to store that input. The main use cases I have for that desired feature is using the playground without having to store my test queries as Terms as well as potentially opening up TACOS for others to experiment with without us introducing a privacy concern storing data from other systems not in our control. The latter is not planned so the former is what I'd like to decide how to handle.

I support having a path through the application to allow for querying with strings, but I'd prefer to make that change in a coordinated way. Of the detectors we have now, every example that uses an initialize method (StandardIdentifiers, LCSH, and now Citation) does so based on a Term method - so I'd prefer to keep this consistency. I worry about having a state, even temporarily, where one detector behaves differently from the others.

As part of such a change, I wonder about how to name the methods to clarify what is expected, i.e. something like:

Detector::Citation.new(phrase) # using built-in "new" as a method Detector::Citation.register_term(term) # method named to clarify expected argument, instead of "record"

But, we can have this discussion more thoroughly in whatever ticket centers that refactoring.

Yup, that makes sense to me.

JPrevost · 2024-10-10T19:36:53Z

app/models/detector/citation.rb

+    # @return hash
+    def summarize(term)
+      %w[characters colons commas periods semicolons words].each do |check|
+        @summary[check.to_sym] = send(check, term)


This took me a bit to follow what was going on even with the docs. It's a solid pattern, but we don't reach for it enough for me to have immediately understood it.

Is there a better way to say "call all these methods" ? I was looking for a structure that was efficient without being opaque, so I'm open to hearing that this doesn't quite hit that spot.

No, I think what you are doing here is great. send is meant for this. I just had to look it up as we don't use it much.

JPrevost · 2024-10-10T19:40:22Z

app/models/detector/citation.rb

+    end
+
+    # The record method first runs all of the parsers by running the initialize method. If the resulting score is higher
+    # than the REQUIRED_SCORE value, then a Detection is registered.


It feels a bit odd to me that the record method is where the decision as to whether the input is a citation or not is made.

I wonder if adding a boolean method "citation?" would be valuable. That would give us more flexibility to report that information to UIs that might want it in the future (I'm fine with not returning anything in GraphQL just yet as we learn more about this Detector's usefulness, but having something in GraphQL that indicated we think it was a citation does feel potentially useful)

def citation? @score >= REQUIRED_SCORE end

Then record could leverage that rather than doing that calculation itself.

return unless cit.citation?

Yeah, this feels like a good method. I'll revise the PR to add this, at least. What do you think about detection? as a name, though? That might be shareable across models?

@detection is what I'm toying with for all pattern matches detected as an instance variable. The expectation would be that it would be an array of name/value pairs.

Having a boolean detection? as a convenience method (which on some models would just be the presence of anything in detection but for this could be a subset that match the score threshold) does seem like a good pattern.

I've pushed a commit and tests to add this, and converted the record method to consume it.

matt-bernhardt · 2024-10-11T15:49:29Z

Okay, I think this is ready for review - I've added a commit to implement a convenience method for the activation check. I'd prefer to handle the change from Terms to Strings as input in a coordinated way, leaving the Detectors to behave consistently for now.

JPrevost

Please remember to create the Detector record manually in Stage/Prod when promoting this work.

mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 9, 2024 20:05 Inactive

mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 10, 2024 15:52 Inactive

matt-bernhardt force-pushed the tco-96 branch from d5491db to ed39b74 Compare October 10, 2024 18:23

mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 10, 2024 18:23 Inactive

matt-bernhardt marked this pull request as ready for review October 10, 2024 18:23

matt-bernhardt requested review from JPrevost and jazairi October 10, 2024 18:26

JPrevost reviewed Oct 10, 2024

View reviewed changes

Add detection? convenience method

fa36a2a

mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 11, 2024 15:47 Inactive

JPrevost self-assigned this Oct 11, 2024

JPrevost approved these changes Oct 11, 2024

View reviewed changes

matt-bernhardt merged commit 4225521 into main Oct 11, 2024
3 checks passed

matt-bernhardt deleted the tco-96 branch October 11, 2024 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a citation detector #119

Add a citation detector #119

matt-bernhardt commented Oct 9, 2024 •

edited by JPrevost

Loading

matt-bernhardt commented Oct 10, 2024

JPrevost left a comment

JPrevost Oct 10, 2024

matt-bernhardt Oct 11, 2024

JPrevost Oct 11, 2024

JPrevost Oct 10, 2024

matt-bernhardt Oct 11, 2024

JPrevost Oct 11, 2024

JPrevost Oct 10, 2024

matt-bernhardt Oct 11, 2024

JPrevost Oct 11, 2024

matt-bernhardt Oct 11, 2024

matt-bernhardt commented Oct 11, 2024

JPrevost left a comment

Add a citation detector #119

Add a citation detector #119

Conversation

matt-bernhardt commented Oct 9, 2024 • edited by JPrevost Loading

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

matt-bernhardt commented Oct 10, 2024

JPrevost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-bernhardt commented Oct 11, 2024

JPrevost left a comment

Choose a reason for hiding this comment

matt-bernhardt commented Oct 9, 2024 •

edited by JPrevost

Loading