Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LCSH detector #113

Merged
merged 2 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions app/graphql/types/detectors_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,26 @@ class DetectorsType < Types::BaseObject
description 'Provides all available search term detectors'

field :journals, [Types::JournalsType], description: 'Information about journals detected in the search term'
field :lcsh, [String], description: 'Library of Congress Subject Heading information'
field :standard_identifiers, [Types::StandardIdentifiersType], description: 'Currently supported: ISBN, ISSN, PMID, DOI'
field :suggested_resources, [Types::SuggestedResourcesType], description: 'Suggested resources detected in the search term'

def standard_identifiers
Detector::StandardIdentifiers.new(@object).identifiers.map do |identifier|
{ kind: identifier.first, value: identifier.last }
end
end

def journals
Detector::Journal.full_term_match(@object).map do |journal|
{ title: journal.name, additional_info: journal.additional_info }
end
end

def lcsh
Detector::Lcsh.new(@object).identifiers.map(&:last)
end

def standard_identifiers
Detector::StandardIdentifiers.new(@object).identifiers.map do |identifier|
{ kind: identifier.first, value: identifier.last }
end
end

def suggested_resources
Detector::SuggestedResource.full_term_match(@object).map do |suggested_resource|
{ title: suggested_resource.title, url: suggested_resource.url }
Expand Down
63 changes: 63 additions & 0 deletions app/models/detector/lcsh.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# frozen_string_literal: true

class Detector
# Detector::LCSH is a very rudimentary detector for the separator between levels of a Library of Congress Subject
# Heading (LCSH). These subject headings follow this pattern: "Social security beneficiaries -- United States"
class Lcsh
attr_reader :identifiers

# For now the initialize method just needs to run the pattern checker. A space for future development would be to
# write additional methods to look up the detected LCSH for more information, and to confirm that the phrase is
# actually an LCSH.
def initialize(term)
@identifiers = {}
term_pattern_checker(term)
end

# The record method will consult the set of regex-based detectors that are defined in Detector::Lcsh. Any matches
# will be registered as Detection records.
#
# @note While there is currently only one check within the Detector::Lcsh class, the method is build to anticipate
# additional checks in the future. Every such check would be capable of generating a separate Detection record
# (although a single check finding multiple matches would still only result in one Detection).
#
# @return nil
def self.record(term)
results = Detector::Lcsh.new(term.phrase)

results.identifiers.each_key do
Detection.find_or_create_by(
term:,
detector: Detector.where(name: 'LCSH').first,
detector_version: ENV.fetch('DETECTOR_VERSION', 'unset')
)
end

nil
end

private

def term_pattern_checker(term)
subject_patterns.each_pair do |type, pattern|
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present?
end
end

# This implementation will only detect the first match of a pattern in a long string. For the separator pattern this
# is fine, as we only need to find one (and finding multiples wouldn't change the outcome). If a pattern does come
# along where match counts matter, this should be reconsidered.
def match(pattern, term)
pattern.match(term).to_s.strip
end

# subject_patterns are regex patterns that can be applied to indicate whether a search string is looking for an LCSH
# string. At the moment there is only one - for the separator character " -- " - but others might be possible if
# there are regex-able vocabulary quirks which might separate subject values from non-subject values.
def subject_patterns
{
separator: /(.*)\s--\s(.*)/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider a pattern that did not include the spaces in addition to this one? When I look at other LCSH sources, I see things like Zimbabwe--Economic policy which would not be detected by our pattern even though it is an LCSH term copied from their site.

If you considered this but were wary to make such a general detector, that is fine but I did think it would be worth asking :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this, but didn't go that route because I think the instances we're seeing in our traffic all have the separator. I wrote the test explicitly to make sure the other option fails as a way of being clear about what we're aiming for.

That said, I'd also be willing to look back at this in the future to see whether we're missing any relevant traffic.

}
end
end
end
21 changes: 19 additions & 2 deletions app/models/metrics/algorithms.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# doi :integer
# issn :integer
# isbn :integer
# lcsh :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
Expand Down Expand Up @@ -48,7 +49,7 @@ def generate(month = nil)
count_matches(SearchEvent.includes(:term))
end
Metrics::Algorithms.create(month:, doi: matches[:doi], issn: matches[:issn], isbn: matches[:isbn],
pmid: matches[:pmid], journal_exact: matches[:journal_exact],
lcsh: matches[:lcsh], pmid: matches[:pmid], journal_exact: matches[:journal_exact],
suggested_resource_exact: matches[:suggested_resource_exact],
unmatched: matches[:unmatched])
end
Expand Down Expand Up @@ -79,8 +80,24 @@ def event_matches(event, matches)
ids = match_standard_identifiers(event, matches)
journal_exact = process_journals(event, matches)
suggested_resource_exact = process_suggested_resources(event, matches)
lcshs = match_lcsh(event, matches)

matches[:unmatched] += 1 if ids.identifiers.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
matches[:unmatched] += 1 if ids.identifiers.blank? && lcshs.identifiers.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
end

# Checks for LCSH matches
#
# @param event [SearchEvent] an individual search event to check for matches
# @param matches [Hash] a Hash that keeps track of how many of each algorithm we match
# @return [Array] an array of matched LCSH sub-patterns
def match_lcsh(event, matches)
known_ids = %i[separator]
ids = Detector::Lcsh.new(event.term.phrase)

known_ids.each do |id|
matches[:lcsh] += 1 if ids.identifiers[id].present?
end
ids
end

# Checks for StandardIdentifer matches
Expand Down
1 change: 1 addition & 0 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ class Term < ApplicationRecord
def record_detections
Detector::StandardIdentifiers.record(self)
Detector::Journal.record(self)
Detector::Lcsh.record(self)
Detector::SuggestedResource.record(self)

nil
Expand Down
5 changes: 5 additions & 0 deletions db/migrate/20241001205152_add_lcsh_to_metrics_algorithm.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class AddLcshToMetricsAlgorithm < ActiveRecord::Migration[7.1]
def change
add_column :metrics_algorithms, :lcsh, :integer
end
end
3 changes: 2 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions db/seeds.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
Detector.find_or_create_by(name: 'DOI')
Detector.find_or_create_by(name: 'ISBN')
Detector.find_or_create_by(name: 'ISSN')
Detector.find_or_create_by(name: 'LCSH')
Detector.find_or_create_by(name: 'PMID')
Detector.find_or_create_by(name: 'Journal')
Detector.find_or_create_by(name: 'SuggestedResource')
Expand All @@ -48,6 +49,11 @@
category: Category.find_by(name: 'Transactional'),
confidence: 0.6
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'LCSH'),
category: Category.find_by(name: 'Informational'),
confidence: 0.7
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'PMID'),
category: Category.find_by(name: 'Transactional'),
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/classes.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ classDiagram
DetectorJournal: partial_term_match()
DetectorJournal: record()

class DetectorLcsh
DetectorLcsh: record()

class DetectorStandardIdentifier
DetectorStandardIdentifier: record()

Expand All @@ -105,6 +108,7 @@ classDiagram
namespace Detectors {
class Detector
class DetectorJournal["Detector::Journal"]
class DetectorLcsh["Detector::Lcsh"]
class DetectorStandardIdentifier["Detector::StandardIdentifiers"]
class DetectorSuggestedResource["Detector::SuggestedResource"]
}
Expand All @@ -116,6 +120,7 @@ classDiagram
style DetectorCategory fill:#000,stroke:#fc8d62,color:#fc8d62
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62
style DetectorJournal fill:#000,stroke:#fc8d62,color:#fc8d62
style DetectorLcsh fill:#000,stroke:#fc8d62,color:#fc8d62
style DetectorStandardIdentifier fill:#000,stroke:#fc8d62,color:#fc8d62
style DetectorSuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62

Expand Down
29 changes: 29 additions & 0 deletions test/controllers/graphql_controller_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,20 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
json['data']['logSearchEvent']['detectors']['suggestedResources'].first['url']
end

test 'search event query can return detected library of congress subject headings' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "Maryland -- Geography") {
detectors {
lcsh
}
}
}' }
json = response.parsed_body

assert_equal 'Maryland -- Geography',
json['data']['logSearchEvent']['detectors']['lcsh'].first
end

test 'search event query can return phrase from logged term' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "10.1038/nphys1170") {
Expand Down Expand Up @@ -170,6 +184,21 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_in_delta 0.95, json['data']['logSearchEvent']['categories'].first['confidence']
end

test 'term lookup query can return detected library of congress subject headings' do
post '/graphql', params: { query: '{
lookupTerm(searchTerm: "Geology -- Massachusetts") {
detectors {
lcsh
}
}
}' }

json = response.parsed_body

assert_equal('Geology -- Massachusetts',
json['data']['lookupTerm']['detectors']['lcsh'].first)
end

test 'term lookup query can return categorization details for searches that trip a detector' do
post '/graphql', params: { query: '{
lookupTerm(searchTerm: "10.1016/j.physio.2010.12.004") {
Expand Down
5 changes: 5 additions & 0 deletions test/fixtures/detector_categories.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,8 @@ five:
detector: journal
category: transactional
confidence: 0.5

six:
detector: lcsh
category: informational
confidence: 0.7
3 changes: 3 additions & 0 deletions test/fixtures/detectors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ isbn:
issn:
name: 'ISSN'

lcsh:
name: 'LCSH'

pmid:
name: 'PMID'

Expand Down
7 changes: 7 additions & 0 deletions test/fixtures/search_events.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ current_month_doi:
current_month_isbn:
term: isbn_9781319145446
source: test
current_month_lcsh:
term: lcsh
source: test
old_month_lcsh:
term: lcsh
source: test
created_at: <%= 1.year.ago %>
current_month_nature_medicine:
term: journal_nature_medicine
source: test
Expand Down
3 changes: 3 additions & 0 deletions test/fixtures/terms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ hi:
pmid_38908367:
phrase: 'TERT activation targets DNA methylation and multiple aging hallmarks. Shim HS, et al. Cell. 2024. PMID: 38908367'

lcsh:
phrase: 'Geology -- Massachusetts'

issn_1075_8623:
phrase: 1075-8623

Expand Down
72 changes: 72 additions & 0 deletions test/models/detector/lcsh_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# frozen_string_literal: true

require 'test_helper'

class Detector
class LcshTest < ActiveSupport::TestCase
test 'lcsh detector activates when a separator is found' do
true_samples = [
'Geology -- Massachusetts',
'Space vehicles -- Materials -- Congresses'
]

true_samples.each do |term|
actual = Detector::Lcsh.new(term).identifiers

assert_includes(actual, :separator)
end
end

test 'lcsh detector does nothing in most cases' do
false_samples = [
'orange cats like popcorn',
'hyphenated names like Lin-Manuel Miranda do nothing',
'dashes used as an aside - like this one - do nothing',
'This one should--also not work'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test I'm talking about, FWIW, which explicitly sets out that this pattern is not one we're expecting.

]

false_samples.each do |term|
actual = Detector::Lcsh.new(term).identifiers

assert_not_includes(actual, :separator)
end
end

test 'record method does relevant work' do
detection_count = Detection.count
t = terms('lcsh')

Detector::Lcsh.record(t)

assert_equal(detection_count + 1, Detection.count)
end

test 'record does nothing when not needed' do
detection_count = Detection.count
t = terms('isbn_9781319145446')

Detector::Lcsh.record(t)

assert_equal(detection_count, Detection.count)
end

test 'record respects changes to the DETECTOR_VERSION value' do
# Create a relevant detection
Detector::Lcsh.record(terms('lcsh'))

detection_count = Detection.count

# Calling the record method again doesn't do anything, but does not error.
Detector::Lcsh.record(terms('lcsh'))

assert_equal(detection_count, Detection.count)

# Calling the record method after DETECTOR_VERSION is incremented results in a new Detection
ClimateControl.modify DETECTOR_VERSION: 'updated' do
Detector::Lcsh.record(terms('lcsh'))

assert_equal detection_count + 1, Detection.count
end
end
end
end
Loading