-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LCSH detector #113
Merged
Merged
Add LCSH detector #113
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# frozen_string_literal: true | ||
|
||
class Detector | ||
# Detector::LCSH is a very rudimentary detector for the separator between levels of a Library of Congress Subject | ||
# Heading (LCSH). These subject headings follow this pattern: "Social security beneficiaries -- United States" | ||
class Lcsh | ||
attr_reader :identifiers | ||
|
||
# For now the initialize method just needs to run the pattern checker. A space for future development would be to | ||
# write additional methods to look up the detected LCSH for more information, and to confirm that the phrase is | ||
# actually an LCSH. | ||
def initialize(term) | ||
@identifiers = {} | ||
term_pattern_checker(term) | ||
end | ||
|
||
# The record method will consult the set of regex-based detectors that are defined in Detector::Lcsh. Any matches | ||
# will be registered as Detection records. | ||
# | ||
# @note While there is currently only one check within the Detector::Lcsh class, the method is build to anticipate | ||
# additional checks in the future. Every such check would be capable of generating a separate Detection record | ||
# (although a single check finding multiple matches would still only result in one Detection). | ||
# | ||
# @return nil | ||
def self.record(term) | ||
results = Detector::Lcsh.new(term.phrase) | ||
|
||
results.identifiers.each_key do | ||
Detection.find_or_create_by( | ||
term:, | ||
detector: Detector.where(name: 'LCSH').first, | ||
detector_version: ENV.fetch('DETECTOR_VERSION', 'unset') | ||
) | ||
end | ||
|
||
nil | ||
end | ||
|
||
private | ||
|
||
def term_pattern_checker(term) | ||
subject_patterns.each_pair do |type, pattern| | ||
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present? | ||
end | ||
end | ||
|
||
# This implementation will only detect the first match of a pattern in a long string. For the separator pattern this | ||
# is fine, as we only need to find one (and finding multiples wouldn't change the outcome). If a pattern does come | ||
# along where match counts matter, this should be reconsidered. | ||
def match(pattern, term) | ||
pattern.match(term).to_s.strip | ||
end | ||
|
||
# subject_patterns are regex patterns that can be applied to indicate whether a search string is looking for an LCSH | ||
# string. At the moment there is only one - for the separator character " -- " - but others might be possible if | ||
# there are regex-able vocabulary quirks which might separate subject values from non-subject values. | ||
def subject_patterns | ||
{ | ||
separator: /(.*)\s--\s(.*)/ | ||
} | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class AddLcshToMetricsAlgorithm < ActiveRecord::Migration[7.1] | ||
def change | ||
add_column :metrics_algorithms, :lcsh, :integer | ||
end | ||
end |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,9 @@ isbn: | |
issn: | ||
name: 'ISSN' | ||
|
||
lcsh: | ||
name: 'LCSH' | ||
|
||
pmid: | ||
name: 'PMID' | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# frozen_string_literal: true | ||
|
||
require 'test_helper' | ||
|
||
class Detector | ||
class LcshTest < ActiveSupport::TestCase | ||
test 'lcsh detector activates when a separator is found' do | ||
true_samples = [ | ||
'Geology -- Massachusetts', | ||
'Space vehicles -- Materials -- Congresses' | ||
] | ||
|
||
true_samples.each do |term| | ||
actual = Detector::Lcsh.new(term).identifiers | ||
|
||
assert_includes(actual, :separator) | ||
end | ||
end | ||
|
||
test 'lcsh detector does nothing in most cases' do | ||
false_samples = [ | ||
'orange cats like popcorn', | ||
'hyphenated names like Lin-Manuel Miranda do nothing', | ||
'dashes used as an aside - like this one - do nothing', | ||
'This one should--also not work' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the test I'm talking about, FWIW, which explicitly sets out that this pattern is not one we're expecting. |
||
] | ||
|
||
false_samples.each do |term| | ||
actual = Detector::Lcsh.new(term).identifiers | ||
|
||
assert_not_includes(actual, :separator) | ||
end | ||
end | ||
|
||
test 'record method does relevant work' do | ||
detection_count = Detection.count | ||
t = terms('lcsh') | ||
|
||
Detector::Lcsh.record(t) | ||
|
||
assert_equal(detection_count + 1, Detection.count) | ||
end | ||
|
||
test 'record does nothing when not needed' do | ||
detection_count = Detection.count | ||
t = terms('isbn_9781319145446') | ||
|
||
Detector::Lcsh.record(t) | ||
|
||
assert_equal(detection_count, Detection.count) | ||
end | ||
|
||
test 'record respects changes to the DETECTOR_VERSION value' do | ||
# Create a relevant detection | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
detection_count = Detection.count | ||
|
||
# Calling the record method again doesn't do anything, but does not error. | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
assert_equal(detection_count, Detection.count) | ||
|
||
# Calling the record method after DETECTOR_VERSION is incremented results in a new Detection | ||
ClimateControl.modify DETECTOR_VERSION: 'updated' do | ||
Detector::Lcsh.record(terms('lcsh')) | ||
|
||
assert_equal detection_count + 1, Detection.count | ||
end | ||
end | ||
end | ||
end |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider a pattern that did not include the spaces in addition to this one? When I look at other LCSH sources, I see things like
Zimbabwe--Economic policy
which would not be detected by our pattern even though it is an LCSH term copied from their site.If you considered this but were wary to make such a general detector, that is fine but I did think it would be worth asking :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, but didn't go that route because I think the instances we're seeing in our traffic all have the separator. I wrote the test explicitly to make sure the other option fails as a way of being clear about what we're aiming for.
That said, I'd also be willing to look back at this in the future to see whether we're missing any relevant traffic.