-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #113 from MITLibraries/tco-71-lcsh
Add LCSH detector
- Loading branch information
Showing
15 changed files
with
250 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# frozen_string_literal: true | ||
|
||
class Detector | ||
# Detector::LCSH is a very rudimentary detector for the separator between levels of a Library of Congress Subject | ||
# Heading (LCSH). These subject headings follow this pattern: "Social security beneficiaries -- United States" | ||
class Lcsh | ||
attr_reader :identifiers | ||
|
||
# For now the initialize method just needs to run the pattern checker. A space for future development would be to | ||
# write additional methods to look up the detected LCSH for more information, and to confirm that the phrase is | ||
# actually an LCSH. | ||
def initialize(term) | ||
@identifiers = {} | ||
term_pattern_checker(term) | ||
end | ||
|
||
# The record method will consult the set of regex-based detectors that are defined in Detector::Lcsh. Any matches | ||
# will be registered as Detection records. | ||
# | ||
# @note While there is currently only one check within the Detector::Lcsh class, the method is build to anticipate | ||
# additional checks in the future. Every such check would be capable of generating a separate Detection record | ||
# (although a single check finding multiple matches would still only result in one Detection). | ||
# | ||
# @return nil | ||
def self.record(term) | ||
results = Detector::Lcsh.new(term.phrase) | ||
|
||
results.identifiers.each_key do | ||
Detection.find_or_create_by( | ||
term:, | ||
detector: Detector.where(name: 'LCSH').first, | ||
detector_version: ENV.fetch('DETECTOR_VERSION', 'unset') | ||
) | ||
end | ||
|
||
nil | ||
end | ||
|
||
private | ||
|
||
def term_pattern_checker(term) | ||
subject_patterns.each_pair do |type, pattern| | ||
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present? | ||
end | ||
end | ||
|
||
# This implementation will only detect the first match of a pattern in a long string. For the separator pattern this | ||
# is fine, as we only need to find one (and finding multiples wouldn't change the outcome). If a pattern does come | ||
# along where match counts matter, this should be reconsidered. | ||
def match(pattern, term) | ||
pattern.match(term).to_s.strip | ||
end | ||
|
||
# subject_patterns are regex patterns that can be applied to indicate whether a search string is looking for an LCSH | ||
# string. At the moment there is only one - for the separator character " -- " - but others might be possible if | ||
# there are regex-able vocabulary quirks which might separate subject values from non-subject values. | ||
def subject_patterns | ||
{ | ||
separator: /(.*)\s--\s(.*)/ | ||
} | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class AddLcshToMetricsAlgorithm < ActiveRecord::Migration[7.1] | ||
def change | ||
add_column :metrics_algorithms, :lcsh, :integer | ||
end | ||
end |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,9 @@ isbn: | |
issn: | ||
name: 'ISSN' | ||
|
||
lcsh: | ||
name: 'LCSH' | ||
|
||
pmid: | ||
name: 'PMID' | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# frozen_string_literal: true | ||
|
||
require 'test_helper' | ||
|
||
class Detector | ||
class LcshTest < ActiveSupport::TestCase | ||
test 'lcsh detector activates when a separator is found' do | ||
true_samples = [ | ||
'Geology -- Massachusetts', | ||
'Space vehicles -- Materials -- Congresses' | ||
] | ||
|
||
true_samples.each do |term| | ||
actual = Detector::Lcsh.new(term).identifiers | ||
|
||
assert_includes(actual, :separator) | ||
end | ||
end | ||
|
||
test 'lcsh detector does nothing in most cases' do | ||
false_samples = [ | ||
'orange cats like popcorn', | ||
'hyphenated names like Lin-Manuel Miranda do nothing', | ||
'dashes used as an aside - like this one - do nothing', | ||
'This one should--also not work' | ||
] | ||
|
||
false_samples.each do |term| | ||
actual = Detector::Lcsh.new(term).identifiers | ||
|
||
assert_not_includes(actual, :separator) | ||
end | ||
end | ||
|
||
test 'record method does relevant work' do | ||
detection_count = Detection.count | ||
t = terms('lcsh') | ||
|
||
Detector::Lcsh.record(t) | ||
|
||
assert_equal(detection_count + 1, Detection.count) | ||
end | ||
|
||
test 'record does nothing when not needed' do | ||
detection_count = Detection.count | ||
t = terms('isbn_9781319145446') | ||
|
||
Detector::Lcsh.record(t) | ||
|
||
assert_equal(detection_count, Detection.count) | ||
end | ||
|
||
test 'record respects changes to the DETECTOR_VERSION value' do | ||
# Create a relevant detection | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
detection_count = Detection.count | ||
|
||
# Calling the record method again doesn't do anything, but does not error. | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
assert_equal(detection_count, Detection.count) | ||
|
||
# Calling the record method after DETECTOR_VERSION is incremented results in a new Detection | ||
ClimateControl.modify DETECTOR_VERSION: 'updated' do | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
assert_equal detection_count + 1, Detection.count | ||
end | ||
end | ||
end | ||
end |
Oops, something went wrong.