Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a citation detector #119

Merged
merged 2 commits into from
Oct 11, 2024
Merged

Add a citation detector #119

merged 2 commits into from
Oct 11, 2024

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Oct 9, 2024

This adds a new type of Detector to the application, a Citation detector that is an example of a more complex pattern detector. It starts by performing a series of individual calculations on the term, building up intermediate signals which are then combined in a final step to calculate a single numeric score. If that score is beyond a defined threshold, the detector is able to register a Detection.

Because this Detector does not do anything beyond calculating the score - for example, it does not confirm whether the citation is to a real article, or look up anything based on that detection, I've elected not to incoporate it into the GraphQL response. I'm not sure that any of the output of the model would make sense to send back to a consuming application - the subpatterns, summary, and score values don't seem to have much use outside TACOS. Maybe they shouldn't be instance variables at all?

As part of this work, I've also written a small rake task that would run this process for all Terms in the system, writing a CSV file that could be useful for offline analysis of its operation. I'm happy to include that in this PR if you think it would be helpful, but it isn't needed for normal operations, so I haven't included it here.

Future detectors might be able to make use of the summary counts or some sub-patterns which I've defined in this class. In that case, I think they could be moved to their own classes - but for now I've left them here to make this work simpler.

Developer

Ticket(s)

This will close https://mitlibraries.atlassian.net/browse/TCO-97
It will also contribute towards two other tickets:
https://mitlibraries.atlassian.net/browse/TCO-95 (extraction-style detector)
https://mitlibraries.atlassian.net/browse/TCO-96 (base detector)

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 9, 2024 20:05 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 10, 2024 15:52 Inactive
** Why are these changes being introduced:

A certain percentage of our search traffic is made up of formal
citations to existing works, in a variety of formats. It would be good
to have a detector to identify these and pluck them out consistently for
further work (reconciliation, re-formatting, etc)

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-97

Also TCO-96 and TCO-95 get some benefit from this.

** How does this address that need:

This adds a new Detector::Citation class, with attendant changes to the
seeds and test fixtures. This is a different type of detector than we've
written in the past, using a multi-layer approach that first compiles
some discrete small information using regexes and counts, which are then
assessed by a second routine that calculates a final score. Terms which
score high enough can have a Detection registered using our usual
workflow.

The smaller discrete signals were designed after looking over examples
of five different citation formats: MLA, APA, Chicago, Terabian, and
IEEE. Examples of these patterns include formats for volume, issue, page
ranges, quoted titles, and name formatting. These are implemented using
regular expressions.

A second set of discrete signals are generated using counts, by looking
at how many characters, words, and specific symbols are found in the
search string (commas, periods, and other potential separators). Each of
these counts are compared to a threshold value, so that if enough of
them are in the term then the citation score gets raised.

While I feel okay about the overall structure of this detector, the
specific thresholds I'm using probably need to be verified against real
world data. I have some ideas about how to pursue this in the future, as
a refinement ticket later on.

** Document any side effects to this change:

* While there are similarities between this detector and the structure
of the StandardIdentifiers detector, I've chosen to vary some parts of
the approach as well (using scan rather than match, for example, or
defining the regexes using a constant). Ultimately I think we should
probably have a standardized approach, but for now I think some
variation might help us compare and contrast between them.
@matt-bernhardt
Copy link
Member Author

Oh, @JPrevost - a question that occurs to me now, but which may just complicate things: even though this citation detector doesn't provide any information to the GraphQL endpoint, should there be a test in the GraphQL controller to confirm that traffic coming in from the API does actually result in this Detector being tripped? That feels like an integration test, rather than a test of the GraphQL endpoint itself?

I'm happy to write up a test suite like this if you'd like.

Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good and is very interesting.

I added a couple of comments that I'd like to let you decide how to address (or not) before finalizing the review as they could potentially change enough that I'd want to start my review over. Neither are blocking, so if you want to leave them as-is please let me know and I'll complete the review with the code as-is.

# @note This method can be called directly via Detector::Citation.new(Term). It is also called indirectly via the
# Detector::Citation.record(Term) instance method. This method can be called directly when a Detection is not
# desired.
def initialize(term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about a future state of having tacos run in "not logging mode". Having our detectors work based on Strings instead of Terms would be key to that potential functionality.

The .record methods could still use the Term as those are explicitly part of our write mode processes.

I'm sharing this just as information on some future direction we may head. If you prefer to leave this initialize method using the Term that is fine. Some of our Detectors use Terms and some use Strings. Eventually I want to normalize that as it is a type of technical debt to have them work differently for now good reason. As we haven't formally picked a path forward, either Term is String is acceptable, but I'm leaning towards proposing we adopt Strings for all Detectors even if we use Terms for record to give us the flexibility checking how TACOS would respond to input without having to store that input. The main use cases I have for that desired feature is using the playground without having to store my test queries as Terms as well as potentially opening up TACOS for others to experiment with without us introducing a privacy concern storing data from other systems not in our control. The latter is not planned so the former is what I'd like to decide how to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support having a path through the application to allow for querying with strings, but I'd prefer to make that change in a coordinated way. Of the detectors we have now, every example that uses an initialize method (StandardIdentifiers, LCSH, and now Citation) does so based on a Term method - so I'd prefer to keep this consistency. I worry about having a state, even temporarily, where one detector behaves differently from the others.

As part of such a change, I wonder about how to name the methods to clarify what is expected, i.e. something like:

Detector::Citation.new(phrase) # using built-in "new" as a method

Detector::Citation.register_term(term) # method named to clarify expected argument, instead of "record"

But, we can have this discussion more thoroughly in whatever ticket centers that refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense to me.

# @return hash
def summarize(term)
%w[characters colons commas periods semicolons words].each do |check|
@summary[check.to_sym] = send(check, term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a bit to follow what was going on even with the docs. It's a solid pattern, but we don't reach for it enough for me to have immediately understood it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to say "call all these methods" ? I was looking for a structure that was efficient without being opaque, so I'm open to hearing that this doesn't quite hit that spot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think what you are doing here is great. send is meant for this. I just had to look it up as we don't use it much.

end

# The record method first runs all of the parsers by running the initialize method. If the resulting score is higher
# than the REQUIRED_SCORE value, then a Detection is registered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit odd to me that the record method is where the decision as to whether the input is a citation or not is made.

I wonder if adding a boolean method "citation?" would be valuable. That would give us more flexibility to report that information to UIs that might want it in the future (I'm fine with not returning anything in GraphQL just yet as we learn more about this Detector's usefulness, but having something in GraphQL that indicated we think it was a citation does feel potentially useful)

def citation?
  @score >= REQUIRED_SCORE
end

Then record could leverage that rather than doing that calculation itself.

return unless cit.citation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like a good method. I'll revise the PR to add this, at least. What do you think about detection? as a name, though? That might be shareable across models?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detection is what I'm toying with for all pattern matches detected as an instance variable. The expectation would be that it would be an array of name/value pairs.

Having a boolean detection? as a convenience method (which on some models would just be the presence of anything in detection but for this could be a subset that match the score threshold) does seem like a good pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit and tests to add this, and converted the record method to consume it.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-119 October 11, 2024 15:47 Inactive
@matt-bernhardt
Copy link
Member Author

Okay, I think this is ready for review - I've added a commit to implement a convenience method for the activation check. I'd prefer to handle the change from Terms to Strings as input in a coordinated way, leaving the Detectors to behave consistently for now.

@JPrevost JPrevost self-assigned this Oct 11, 2024
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to create the Detector record manually in Stage/Prod when promoting this work.

:shipit:

@matt-bernhardt matt-bernhardt merged commit 4225521 into main Oct 11, 2024
3 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-96 branch October 11, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants