Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic fingerprinting for Term records #138

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Nov 13, 2024

This adds a TermFingerprint model, which allows for clustering of terms by a shared fingerprint. Full details are in the commit message.

Developer

Ticket(s)

This started as an experiment that kept seeming promising, so I didn't create a ticket for it. Now that I'm writing the PR text, I think maybe I missed an opportunity to do this - but am not going to paper over that decision now without checking in with the project team.

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

** Why are these changes being introduced:

After watching our search traffic for a few months, we have seen some
terms come in that are clearly related:

* 'Scientific American'
* 'scientific american'
* '"Scientific american"'

While there are valid reasons to store the term exactly as the user has
submitted it - there are also good reasons to standardize these values,
to look for clusters and related terms.

** Relevant ticket(s):

n/a

** How does this address that need:

This defines a TermFingerprint model, which is related to the Term model
via a belongs_to relationship (a Term belongs to its TermFingerprint,
and a TermFingerprint can have many Terms). The migration to define this
model includes a migration to populate records for Terms we've already
received.

The Term model gets two lifecycle hooks, one to add new fingerprints for
every term, and another to delete a TermFingerprint if no remaining
Terms reference it.

Beyond the methods needed to handle these creations and deletions in as
seemless a way as possible, we also add a .cluster method to the Term
model, which will return an array of all related Terms (but not the Term
itself - so a Term that has a unique fingerprint will return an empty
array). This method will be used as part of a future inspection UI.

Building the models in this way will also allow for querying based on
shared fingerprints - for example by adding a :has_many_terms scope
on the TermFingerprint model.

** Document any side effects to this change:

Two things:

1. A quirk of this implementation is that it is possible to delete a
TermFingerprint record, which does not delete the related Terms (and
thus also SearchEvents, Detections, and Categorizations). The Term will
then have a null fingerprint. The next time the Term is saved, its
fingerprint will be regenerated. No other operation should be impacted
by this arrangement.

For the life of me I can't anticipate why we might delete a fingerprint
- but the relationship needs to be optional to avoid further pain when
working in the console.

2. The process for calculating this TermFingerprint value is similar -
but not identical - to that for handling the SuggestedResource records.
The difference is that the TermFingerprint method removes """
sequences - which might need to be added to SuggestedResource. Both
methods should probably be abstracted out to a shared helper method,
honestly.
I noticed this when running make lint - not sure how it slipped
past my notice when working on the confirmation work.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 November 13, 2024 21:50 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants