Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LCSH detector #113

Merged
merged 2 commits into from
Oct 4, 2024
Merged

Add LCSH detector #113

merged 2 commits into from
Oct 4, 2024

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Oct 2, 2024

Summary of changes (please refer to commit messages for full details)

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-71

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 18:23 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 18:29 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 19:55 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 20:14 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 20:55 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 22:21 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 22:33 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 22:38 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 2, 2024 22:41 Inactive
@matt-bernhardt matt-bernhardt marked this pull request as ready for review October 3, 2024 14:05
** Why are these changes being introduced:

We have noticed a significant volume of search traffic that looks like
multi-level LCSH headings, like "Geology -- Massachusetts". These likely
come from the Bento UI, which makes subject values like this clickable.

It makes sense to try and write a detector for this pattern, especially
as it would be the first detector which would resolve to an
Informational (or subject-based) search.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-71

** How does this address that need:

This writes a new Detector::Lcsh class, which uses a regex to look for a
' -- ' separator. The class is patterned off of the StandardIdentifier
class. I initially wrote this as part of that class, but this doesn't
really belong there - the pattern isn't an identifier in that sense, and
further work to identify subjects (particularly single-level subjects
like "Geology" rather than just "Geology -- Massachusetts") will go
beyond just using a regex for detections.

Adding this class has follow-on changes to the Term, Metrics, and
GraphQL areas of the application.

Outside the app code, there are a variety of tests, changes to the db
seeds, and a new migration to record the item counts that come from the
metrics work.

There will be a future ticket to look up the detected string in the set
of subject headings, to return more than just a string in the GraphQL.
Right now the GraphQL response is pretty useless, just sending the
search string back. It would be good to include include something else.

** Document any side effects to this change:

The Detectors Type file has its methods alphabetized.

I'm not sure if Detectors::Lcsh should instead be Detectors::LCSH?
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-113 October 4, 2024 19:30 Inactive
@JPrevost JPrevost self-assigned this Oct 4, 2024
# there are regex-able vocabulary quirks which might separate subject values from non-subject values.
def subject_patterns
{
separator: /(.*)\s--\s(.*)/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider a pattern that did not include the spaces in addition to this one? When I look at other LCSH sources, I see things like Zimbabwe--Economic policy which would not be detected by our pattern even though it is an LCSH term copied from their site.

If you considered this but were wary to make such a general detector, that is fine but I did think it would be worth asking :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this, but didn't go that route because I think the instances we're seeing in our traffic all have the separator. I wrote the test explicitly to make sure the other option fails as a way of being clear about what we're aiming for.

That said, I'd also be willing to look back at this in the future to see whether we're missing any relevant traffic.

'orange cats like popcorn',
'hyphenated names like Lin-Manuel Miranda do nothing',
'dashes used as an aside - like this one - do nothing',
'This one should--also not work'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test I'm talking about, FWIW, which explicitly sets out that this pattern is not one we're expecting.

@matt-bernhardt matt-bernhardt merged commit 7d6a2e3 into main Oct 4, 2024
3 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-71-lcsh branch October 4, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants