Skip to content

Commit

Permalink
Implement detector for LCSH values
Browse files Browse the repository at this point in the history
** Why are these changes being introduced:

We have noticed a significant volume of search traffic that looks like
multi-level LCSH headings, like "Geology -- Massachusetts". These likely
come from the Bento UI, which makes subject values like this clickable.

It makes sense to try and write a detector for this pattern, especially
as it would be the first detector which would resolve to an
Informational (or subject-based) search.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-71

** How does this address that need:

This writes a new Detector::Lcsh class, which uses a regex to look for a
' -- ' separator. The class is patterned off of the StandardIdentifier
class. I initially wrote this as part of that class, but this doesn't
really belong there - the pattern isn't an identifier in that sense, and
further work to identify subjects (particularly single-level subjects
like "Geology" rather than just "Geology -- Massachusetts") will go
beyond just using a regex for detections.

Adding this class has follow-on changes to the Term, Metrics, and
GraphQL areas of the application.

Outside the app code, there are a variety of tests, changes to the db
seeds, and a new migration to record the item counts that come from the
metrics work.

There will be a future ticket to look up the detected string in the set
of subject headings, to return more than just a string in the GraphQL.
Right now the GraphQL response is pretty useless, just sending the
search string back. It would be good to include include something else.

** Document any side effects to this change:

The Detectors Type file has its methods alphabetized.

I'm not sure if Detectors::Lcsh should instead be Detectors::LCSH?
  • Loading branch information
matt-bernhardt committed Oct 2, 2024
1 parent 862a979 commit ed58d24
Show file tree
Hide file tree
Showing 14 changed files with 221 additions and 10 deletions.
17 changes: 11 additions & 6 deletions app/graphql/types/detectors_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,26 @@ class DetectorsType < Types::BaseObject
description 'Provides all available search term detectors'

field :journals, [Types::JournalsType], description: 'Information about journals detected in the search term'
field :lcsh, [String], description: 'Library of Congress Subject Heading information'
field :standard_identifiers, [Types::StandardIdentifiersType], description: 'Currently supported: ISBN, ISSN, PMID, DOI'
field :suggested_resources, [Types::SuggestedResourcesType], description: 'Suggested resources detected in the search term'

def standard_identifiers
Detector::StandardIdentifiers.new(@object).identifiers.map do |identifier|
{ kind: identifier.first, value: identifier.last }
end
end

def journals
Detector::Journal.full_term_match(@object).map do |journal|
{ title: journal.name, additional_info: journal.additional_info }
end
end

def lcsh
Detector::Lcsh.new(@object).identifiers.map(&:last)
end

def standard_identifiers
Detector::StandardIdentifiers.new(@object).identifiers.map do |identifier|
{ kind: identifier.first, value: identifier.last }
end
end

def suggested_resources
Detector::SuggestedResource.full_term_match(@object).map do |suggested_resource|
{ title: suggested_resource.title, url: suggested_resource.url }
Expand Down
62 changes: 62 additions & 0 deletions app/models/detector/lcsh.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# frozen_string_literal: true

class Detector
# Detector::LCSH is a very rudimentary detector for the separator between levels of a Library of Congress Subject
# Heading (LCSH). These subject headings follow this pattern: "Social security beneficiaries -- United States"
class Lcsh
attr_reader :identifiers

# For now the initialize method just needs to run the pattern checker. A space for future development would be to
# write additional methods to look up the detected LCSH for more information, and to confirm that the phrase is
# actually an LCSH.
def initialize(term)
@identifiers = {}
term_pattern_checker(term)
end

# The record method will consult the set of regex-based detectors that are defined in Detector::Lcsh. Any matches
# will be registered as Detection records.
#
# @note While there is currently only one check within the Detector::Lcsh class, the method is build to anticipate
# additional checks in the future. Every such check would be capable of generating a separate Detection record
# (although a single check finding multiple matches would still only result in one Detection).
#
# @return nil
def self.record(term)
results = Detector::Lcsh.new(term.phrase)

results.identifiers.each_key do
Detection.find_or_create_by(
term:,
detector: Detector.where(name: 'LCSH').first
)
end

nil
end

private

def term_pattern_checker(term)
subject_patterns.each_pair do |type, pattern|
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present?
end
end

# This implementation will only detect the first match of a pattern in a long string. For the separator pattern this
# is fine, as we only need to find one (and finding multiples wouldn't change the outcome). If a pattern does come
# along where match counts matter, this should be reconsidered.
def match(pattern, term)
pattern.match(term).to_s.strip
end

# subject_patterns are regex patterns that can be applied to indicate whether a search string is looking for an LCSH
# string. At the moment there is only one - for the separator character " -- " - but others might be possible if
# there are regex-able vocabulary quirks which might separate subject values from non-subject values.
def subject_patterns
{
separator: /(.*)\s--\s(.*)/
}
end
end
end
18 changes: 15 additions & 3 deletions app/models/metrics/algorithms.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# doi :integer
# issn :integer
# isbn :integer
# lcsh :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
Expand Down Expand Up @@ -48,7 +49,7 @@ def generate(month = nil)
count_matches(SearchEvent.includes(:term))
end
Metrics::Algorithms.create(month:, doi: matches[:doi], issn: matches[:issn], isbn: matches[:isbn],
pmid: matches[:pmid], journal_exact: matches[:journal_exact],
lcsh: matches[:lcsh], pmid: matches[:pmid], journal_exact: matches[:journal_exact],
suggested_resource_exact: matches[:suggested_resource_exact],
unmatched: matches[:unmatched])
end
Expand Down Expand Up @@ -79,8 +80,19 @@ def event_matches(event, matches)
ids = match_standard_identifiers(event, matches)
journal_exact = process_journals(event, matches)
suggested_resource_exact = process_suggested_resources(event, matches)
lcshs = match_lcsh(event, matches)

matches[:unmatched] += 1 if ids.identifiers.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
matches[:unmatched] += 1 if ids.identifiers.blank? && lcshs.identifiers.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
end

def match_lcsh(event, matches)
known_ids = %i[separator]
ids = Detector::Lcsh.new(event.term.phrase)

known_ids.each do |id|
matches[:lcsh] += 1 if ids.identifiers[id].present?
end
ids
end

# Checks for StandardIdentifer matches
Expand All @@ -89,7 +101,7 @@ def event_matches(event, matches)
# @param matches [Hash] a Hash that keeps track of how many of each algorithm we match
# @return [Array] an array of matched StandardIdentifiers
def match_standard_identifiers(event, matches)
known_ids = %i[unmatched pmid isbn issn doi]
known_ids = %i[unmatched doi isbn issn pmid]
ids = Detector::StandardIdentifiers.new(event.term.phrase)

known_ids.each do |id|
Expand Down
1 change: 1 addition & 0 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ class Term < ApplicationRecord
def record_detections
Detector::StandardIdentifiers.record(self)
Detector::Journal.record(self)
Detector::Lcsh.record(self)
Detector::SuggestedResource.record(self)

nil
Expand Down
5 changes: 5 additions & 0 deletions db/migrate/20241001205152_add_lcsh_to_metrics_algorithm.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class AddLcshToMetricsAlgorithm < ActiveRecord::Migration[7.1]
def change
add_column :metrics_algorithms, :lcsh, :integer
end
end
3 changes: 2 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions db/seeds.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
Detector.find_or_create_by(name: 'DOI')
Detector.find_or_create_by(name: 'ISBN')
Detector.find_or_create_by(name: 'ISSN')
Detector.find_or_create_by(name: 'LCSH')
Detector.find_or_create_by(name: 'PMID')
Detector.find_or_create_by(name: 'Journal')
Detector.find_or_create_by(name: 'SuggestedResource')
Expand All @@ -48,6 +49,11 @@
category: Category.find_by(name: 'Transactional'),
confidence: 0.6
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'LCSH'),
category: Category.find_by(name: 'Informational'),
confidence: 0.7
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'PMID'),
category: Category.find_by(name: 'Transactional'),
Expand Down
29 changes: 29 additions & 0 deletions test/controllers/graphql_controller_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,20 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
json['data']['logSearchEvent']['detectors']['suggestedResources'].first['url']
end

test 'search event query can return detected library of congress subject headings' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "Maryland -- Geography") {
detectors {
lcsh
}
}
}' }
json = response.parsed_body

assert_equal 'Maryland -- Geography',
json['data']['logSearchEvent']['detectors']['lcsh'].first
end

test 'search event query can return phrase from logged term' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "10.1038/nphys1170") {
Expand Down Expand Up @@ -170,6 +184,21 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_in_delta 0.95, json['data']['logSearchEvent']['categories'].first['confidence']
end

test 'term lookup query can return detected library of congress subject headings' do
post '/graphql', params: { query: '{
lookupTerm(searchTerm: "Geology -- Massachusetts") {
detectors {
lcsh
}
}
}' }

json = response.parsed_body

assert_equal('Geology -- Massachusetts',
json['data']['lookupTerm']['detectors']['lcsh'].first)
end

test 'term lookup query can return categorization details for searches that trip a detector' do
post '/graphql', params: { query: '{
lookupTerm(searchTerm: "10.1016/j.physio.2010.12.004") {
Expand Down
5 changes: 5 additions & 0 deletions test/fixtures/detector_categories.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,8 @@ five:
detector: journal
category: transactional
confidence: 0.5

six:
detector: lcsh
category: informational
confidence: 0.7
3 changes: 3 additions & 0 deletions test/fixtures/detectors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ isbn:
issn:
name: 'ISSN'

lcsh:
name: 'LCSH'

pmid:
name: 'PMID'

Expand Down
7 changes: 7 additions & 0 deletions test/fixtures/search_events.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ current_month_doi:
current_month_isbn:
term: isbn_9781319145446
source: test
current_month_lcsh:
term: lcsh
source: test
old_month_lcsh:
term: lcsh
source: test
created_at: <%= 1.year.ago %>
current_month_nature_medicine:
term: journal_nature_medicine
source: test
Expand Down
3 changes: 3 additions & 0 deletions test/fixtures/terms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ hi:
pmid_38908367:
phrase: 'TERT activation targets DNA methylation and multiple aging hallmarks. Shim HS, et al. Cell. 2024. PMID: 38908367'

lcsh:
phrase: 'Geology -- Massachusetts'

issn_1075_8623:
phrase: 1075-8623

Expand Down
53 changes: 53 additions & 0 deletions test/models/detector/lcsh_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# frozen_string_literal: true

require 'test_helper'

class Detector
class LcshTest < ActiveSupport::TestCase
test 'lcsh detector activates when a separator is found' do
true_samples = [
'Geology -- Massachusetts',
'Space vehicles -- Materials -- Congresses'
]

true_samples.each do |term|
actual = Detector::Lcsh.new(term).identifiers

assert_includes(actual, :separator)
end
end

test 'lcsh detector does nothing in most cases' do
false_samples = [
'orange cats like popcorn',
'hyphenated names like Lin-Manuel Miranda do nothing',
'dashes used as an aside - like this one - do nothing',
'This one should--also not work'
]

false_samples.each do |term|
actual = Detector::Lcsh.new(term).identifiers

assert_not_includes(actual, :separator)
end
end

test 'record method does relevant work' do
detection_count = Detection.count
t = terms('lcsh')

Detector::Lcsh.record(t)

assert_equal(detection_count + 1, Detection.count)
end

test 'record does nothing when not needed' do
detection_count = Detection.count
t = terms('isbn_9781319145446')

Detector::Lcsh.record(t)

assert_equal(detection_count, Detection.count)
end
end
end
19 changes: 19 additions & 0 deletions test/models/metrics/algorithms_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# doi :integer
# issn :integer
# isbn :integer
# lcsh :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
Expand Down Expand Up @@ -38,6 +39,12 @@ class Algorithms < ActiveSupport::TestCase
assert_equal 1, aggregate.isbn
end

test 'lcsh counts are included in monthly aggregation' do
aggregate = Metrics::Algorithms.new.generate(DateTime.now)

assert_equal 1, aggregate.lcsh
end

test 'pmids counts are included in monthly aggregation' do
aggregate = Metrics::Algorithms.new.generate(DateTime.now)

Expand Down Expand Up @@ -93,6 +100,11 @@ class Algorithms < ActiveSupport::TestCase
SearchEvent.create(term: terms(:isbn_9781319145446), source: 'test')
end

lcsh_expected_count = rand(1...100)
lcsh_expected_count.times do
SearchEvent.create(term: terms(:lcsh), source: 'test')
end

pmid_expected_count = rand(1...100)
pmid_expected_count.times do
SearchEvent.create(term: terms(:pmid_38908367), source: 'test')
Expand All @@ -108,6 +120,7 @@ class Algorithms < ActiveSupport::TestCase
assert_equal doi_expected_count, aggregate.doi
assert_equal issn_expected_count, aggregate.issn
assert_equal isbn_expected_count, aggregate.isbn
assert_equal lcsh_expected_count, aggregate.lcsh
assert_equal pmid_expected_count, aggregate.pmid
assert_equal unmatched_expected_count, aggregate.unmatched
end
Expand All @@ -131,6 +144,12 @@ class Algorithms < ActiveSupport::TestCase
assert_equal 1, aggregate.isbn
end

test 'lcsh counts are included in total aggregation' do
aggregate = Metrics::Algorithms.new.generate

assert_equal 2, aggregate.lcsh
end

test 'pmids counts are included in total aggregation' do
aggregate = Metrics::Algorithms.new.generate

Expand Down

0 comments on commit ed58d24

Please sign in to comment.