-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a citation detector #119
Conversation
** Why are these changes being introduced: A certain percentage of our search traffic is made up of formal citations to existing works, in a variety of formats. It would be good to have a detector to identify these and pluck them out consistently for further work (reconciliation, re-formatting, etc) ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-97 Also TCO-96 and TCO-95 get some benefit from this. ** How does this address that need: This adds a new Detector::Citation class, with attendant changes to the seeds and test fixtures. This is a different type of detector than we've written in the past, using a multi-layer approach that first compiles some discrete small information using regexes and counts, which are then assessed by a second routine that calculates a final score. Terms which score high enough can have a Detection registered using our usual workflow. The smaller discrete signals were designed after looking over examples of five different citation formats: MLA, APA, Chicago, Terabian, and IEEE. Examples of these patterns include formats for volume, issue, page ranges, quoted titles, and name formatting. These are implemented using regular expressions. A second set of discrete signals are generated using counts, by looking at how many characters, words, and specific symbols are found in the search string (commas, periods, and other potential separators). Each of these counts are compared to a threshold value, so that if enough of them are in the term then the citation score gets raised. While I feel okay about the overall structure of this detector, the specific thresholds I'm using probably need to be verified against real world data. I have some ideas about how to pursue this in the future, as a refinement ticket later on. ** Document any side effects to this change: * While there are similarities between this detector and the structure of the StandardIdentifiers detector, I've chosen to vary some parts of the approach as well (using scan rather than match, for example, or defining the regexes using a constant). Ultimately I think we should probably have a standardized approach, but for now I think some variation might help us compare and contrast between them.
d5491db
to
ed39b74
Compare
Oh, @JPrevost - a question that occurs to me now, but which may just complicate things: even though this citation detector doesn't provide any information to the GraphQL endpoint, should there be a test in the GraphQL controller to confirm that traffic coming in from the API does actually result in this Detector being tripped? That feels like an integration test, rather than a test of the GraphQL endpoint itself? I'm happy to write up a test suite like this if you'd like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really good and is very interesting.
I added a couple of comments that I'd like to let you decide how to address (or not) before finalizing the review as they could potentially change enough that I'd want to start my review over. Neither are blocking, so if you want to leave them as-is please let me know and I'll complete the review with the code as-is.
# @note This method can be called directly via Detector::Citation.new(Term). It is also called indirectly via the | ||
# Detector::Citation.record(Term) instance method. This method can be called directly when a Detection is not | ||
# desired. | ||
def initialize(term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about a future state of having tacos run in "not logging mode". Having our detectors work based on Strings
instead of Terms
would be key to that potential functionality.
The .record
methods could still use the Term
as those are explicitly part of our write mode processes.
I'm sharing this just as information on some future direction we may head. If you prefer to leave this initialize method using the Term that is fine. Some of our Detectors use Terms and some use Strings. Eventually I want to normalize that as it is a type of technical debt to have them work differently for now good reason. As we haven't formally picked a path forward, either Term is String is acceptable, but I'm leaning towards proposing we adopt Strings for all Detectors even if we use Terms for record
to give us the flexibility checking how TACOS would respond to input without having to store that input. The main use cases I have for that desired feature is using the playground without having to store my test queries as Terms as well as potentially opening up TACOS for others to experiment with without us introducing a privacy concern storing data from other systems not in our control. The latter is not planned so the former is what I'd like to decide how to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support having a path through the application to allow for querying with strings, but I'd prefer to make that change in a coordinated way. Of the detectors we have now, every example that uses an initialize
method (StandardIdentifiers, LCSH, and now Citation) does so based on a Term method - so I'd prefer to keep this consistency. I worry about having a state, even temporarily, where one detector behaves differently from the others.
As part of such a change, I wonder about how to name the methods to clarify what is expected, i.e. something like:
Detector::Citation.new(phrase) # using built-in "new" as a method
Detector::Citation.register_term(term) # method named to clarify expected argument, instead of "record"
But, we can have this discussion more thoroughly in whatever ticket centers that refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that makes sense to me.
# @return hash | ||
def summarize(term) | ||
%w[characters colons commas periods semicolons words].each do |check| | ||
@summary[check.to_sym] = send(check, term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a bit to follow what was going on even with the docs. It's a solid pattern, but we don't reach for it enough for me to have immediately understood it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to say "call all these methods" ? I was looking for a structure that was efficient without being opaque, so I'm open to hearing that this doesn't quite hit that spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think what you are doing here is great. send
is meant for this. I just had to look it up as we don't use it much.
end | ||
|
||
# The record method first runs all of the parsers by running the initialize method. If the resulting score is higher | ||
# than the REQUIRED_SCORE value, then a Detection is registered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit odd to me that the record method is where the decision as to whether the input is a citation or not is made.
I wonder if adding a boolean method "citation?" would be valuable. That would give us more flexibility to report that information to UIs that might want it in the future (I'm fine with not returning anything in GraphQL just yet as we learn more about this Detector's usefulness, but having something in GraphQL that indicated we think it was a citation does feel potentially useful)
def citation?
@score >= REQUIRED_SCORE
end
Then record could leverage that rather than doing that calculation itself.
return unless cit.citation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like a good method. I'll revise the PR to add this, at least. What do you think about detection?
as a name, though? That might be shareable across models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detection
is what I'm toying with for all pattern matches detected as an instance variable. The expectation would be that it would be an array of name/value pairs.
Having a boolean detection?
as a convenience method (which on some models would just be the presence of anything in detection
but for this could be a subset that match the score threshold) does seem like a good pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a commit and tests to add this, and converted the record
method to consume it.
Okay, I think this is ready for review - I've added a commit to implement a convenience method for the activation check. I'd prefer to handle the change from Terms to Strings as input in a coordinated way, leaving the Detectors to behave consistently for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remember to create the Detector record manually in Stage/Prod when promoting this work.
This adds a new type of Detector to the application, a Citation detector that is an example of a more complex pattern detector. It starts by performing a series of individual calculations on the term, building up intermediate signals which are then combined in a final step to calculate a single numeric score. If that score is beyond a defined threshold, the detector is able to register a Detection.
Because this Detector does not do anything beyond calculating the score - for example, it does not confirm whether the citation is to a real article, or look up anything based on that detection, I've elected not to incoporate it into the GraphQL response. I'm not sure that any of the output of the model would make sense to send back to a consuming application - the subpatterns, summary, and score values don't seem to have much use outside TACOS. Maybe they shouldn't be instance variables at all?
As part of this work, I've also written a small rake task that would run this process for all Terms in the system, writing a CSV file that could be useful for offline analysis of its operation. I'm happy to include that in this PR if you think it would be helpful, but it isn't needed for normal operations, so I haven't included it here.
Future detectors might be able to make use of the summary counts or some sub-patterns which I've defined in this class. In that case, I think they could be moved to their own classes - but for now I've left them here to make this work simpler.
Developer
Ticket(s)
This will close https://mitlibraries.atlassian.net/browse/TCO-97
It will also contribute towards two other tickets:
https://mitlibraries.atlassian.net/browse/TCO-95 (extraction-style detector)
https://mitlibraries.atlassian.net/browse/TCO-96 (base detector)
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
NO migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing