Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a citation detector #119

Merged
merged 2 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions app/models/detector/citation.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# frozen_string_literal: true

class Detector
# Detector::Citation attempts to identify citations based on the prevalence of individual sub-patterns. It is not
# targeted at a particular citation format, but was designed based on characteristics of five formats: APA, MLA,
# Chicago, Terabian, and IEEE.
#
# It receives a Term object, which is parsed in various ways en route to calculating a final score. Terms with a
# higher score are more citation-like, while a score of 0 indicates a Term that has no hallmarks of being a citation.
# Terms whose score is higher than the REQUIRED_SCORE value can be registered as a Detection.
class Citation
attr_reader :score, :subpatterns, :summary

# Citation patterns are regular expressions which attempt to identify structures that are part of many citations.
# This object is used as part of the pattern_checker method. Some of these patterns may get promoted to the Detector
# model if they prove useful beyond a Citation context.
CITATION_PATTERNS = {
apa_volume_issue: /\d+\(\d+\)/,
no: /no\.\s\d+/,
pages: /\d+-+\d+/,
pp: /pp\.\s\d+/,
vol: /vol\.\s\d+/,
year_parens: /\(\d{4}\)/,
brackets: /\[.*?\]/,
lastnames: /[A-Z][a-z]+[.,]/,
quotes: /".*?"/
}.freeze

# The required score value is the threshold needed for a Term to be officially recorded with a Detection.
REQUIRED_SCORE = 6

# Summary thresholds are used by the calculate_score method. This class counts the number of occurrences of specific
# characters in the @summary instance variable. The thresholds here determine whether any of those counts are high
# enough to contribute to the Term's citation score.
SUMMARY_THRESHOLDS = {
characters: 25,
colons: 2,
commas: 2,
periods: 2,
semicolons: 2,
words: 5
}.freeze

# Detection? is a convenience method to check whether the calculated @score is high enough to qualify as a citation.
#
# @return boolean
def detection?
@score >= REQUIRED_SCORE
end

# The initializer handles the parsing of a Term object, and subsequent population of the @subpatterns, @summary,
# and @score instance variables. @subpatterns contains all the citation components which have been flagged by the
# CITATION_PATTERNS hash. @summary contains counts of how often certain characters or words appear in the Term.
# Finally, the @score value is a summary of how many elements in the subpatterns or summary report were detected.
#
# @note This method can be called directly via Detector::Citation.new(Term). It is also called indirectly via the
# Detector::Citation.record(Term) instance method. This method can be called directly when a Detection is not
# desired.
def initialize(term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about a future state of having tacos run in "not logging mode". Having our detectors work based on Strings instead of Terms would be key to that potential functionality.

The .record methods could still use the Term as those are explicitly part of our write mode processes.

I'm sharing this just as information on some future direction we may head. If you prefer to leave this initialize method using the Term that is fine. Some of our Detectors use Terms and some use Strings. Eventually I want to normalize that as it is a type of technical debt to have them work differently for now good reason. As we haven't formally picked a path forward, either Term is String is acceptable, but I'm leaning towards proposing we adopt Strings for all Detectors even if we use Terms for record to give us the flexibility checking how TACOS would respond to input without having to store that input. The main use cases I have for that desired feature is using the playground without having to store my test queries as Terms as well as potentially opening up TACOS for others to experiment with without us introducing a privacy concern storing data from other systems not in our control. The latter is not planned so the former is what I'd like to decide how to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support having a path through the application to allow for querying with strings, but I'd prefer to make that change in a coordinated way. Of the detectors we have now, every example that uses an initialize method (StandardIdentifiers, LCSH, and now Citation) does so based on a Term method - so I'd prefer to keep this consistency. I worry about having a state, even temporarily, where one detector behaves differently from the others.

As part of such a change, I wonder about how to name the methods to clarify what is expected, i.e. something like:

Detector::Citation.new(phrase) # using built-in "new" as a method

Detector::Citation.register_term(term) # method named to clarify expected argument, instead of "record"

But, we can have this discussion more thoroughly in whatever ticket centers that refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense to me.

@subpatterns = {}
@summary = {}
pattern_checker(term.phrase)
summarize(term.phrase)
@score = calculate_score
end

# The record method first runs all of the parsers by running the initialize method. If the resulting score is higher
# than the REQUIRED_SCORE value, then a Detection is registered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit odd to me that the record method is where the decision as to whether the input is a citation or not is made.

I wonder if adding a boolean method "citation?" would be valuable. That would give us more flexibility to report that information to UIs that might want it in the future (I'm fine with not returning anything in GraphQL just yet as we learn more about this Detector's usefulness, but having something in GraphQL that indicated we think it was a citation does feel potentially useful)

def citation?
  @score >= REQUIRED_SCORE
end

Then record could leverage that rather than doing that calculation itself.

return unless cit.citation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like a good method. I'll revise the PR to add this, at least. What do you think about detection? as a name, though? That might be shareable across models?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detection is what I'm toying with for all pattern matches detected as an instance variable. The expectation would be that it would be an array of name/value pairs.

Having a boolean detection? as a convenience method (which on some models would just be the presence of anything in detection but for this could be a subset that match the score threshold) does seem like a good pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit and tests to add this, and converted the record method to consume it.

#
# @return nil
def self.record(term)
cit = Detector::Citation.new(term)
return unless cit.detection?

Detection.find_or_create_by(
term:,
detector: Detector.where(name: 'Citation').first,
detector_version: ENV.fetch('DETECTOR_VERSION', 'unset')
)

nil
end

private

# This combines the two reports generated by the Citation detector (subpatterns and summary), and calculates the
# final score value from their contents.
#
# Any detected subpattern is counted toward the score (multiple detections do not get counted twice). For example,
# if the brackets pattern finds two matches, it still only adds one to the final score.
#
# For the summary report, each value is compared with a threshold value in the SUMMARY_THRESHOLDS constant. The
# number of values which meet or exceed their threshold are added to the score. As an example, if a search term has
# five words, this value is compared to the word threshold (also five). Because the threshold is met, the score gets
# incremented by one.
#
# @return integer
def calculate_score
summary_score = @summary.count do |key, value|
SUMMARY_THRESHOLDS.key?(key) && value >= SUMMARY_THRESHOLDS[key]
end

summary_score + @subpatterns.length
end

# This calculates the number of characters in the search term. It is called by the summarize method.
def characters(term)
term.length
end

# This counts the number of colons that appear in the search term, because they tend to appear more often in
# citations than in other searches. It is called by the summarize method.
def colons(term)
term.count(':')
end

# This counts the number of commas in the search term. It is called by the summarize method.
def commas(term)
term.count(',')
end

# This builds one of the two main components of the Citation detector - the subpattern report. It uses each of the
# regular expressions in the CITATION_PATTERNS constant, extracting all matches using the scan method.
#
# @return hash
def pattern_checker(term)
CITATION_PATTERNS.each_pair do |type, pattern|
@subpatterns[type.to_sym] = scan(pattern, term) if scan(pattern, term).present?
end
end

# This counts the number of periods in the search term. It is called by the summarize method.
def periods(term)
term.count('.')
end

# This is a convenience method for the scan method, which is used by pattern_checker.
def scan(pattern, term)
term.scan(pattern).map(&:strip)
end

# This counts the semicolons in the search term. It is called by the summarize method.
def semicolons(term)
term.count(';')
end

# This builds one of the two main components of the Citation detector - the summary report. It calls each of the
# methods in the first line - which all return integers - and puts the result as a key-value pair in the @summary
# instance variable.
#
# @return hash
def summarize(term)
%w[characters colons commas periods semicolons words].each do |check|
@summary[check.to_sym] = send(check, term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a bit to follow what was going on even with the docs. It's a solid pattern, but we don't reach for it enough for me to have immediately understood it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to say "call all these methods" ? I was looking for a structure that was efficient without being opaque, so I'm open to hearing that this doesn't quite hit that spot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think what you are doing here is great. send is meant for this. I just had to look it up as we don't use it much.

end
end

# This counts the number of words in the search term. It is called by the summarize method.
def words(term)
term.split.length
end
end
end
1 change: 1 addition & 0 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ class Term < ApplicationRecord
#
# @return nil
def record_detections
Detector::Citation.record(self)
Detector::StandardIdentifiers.record(self)
Detector::Journal.record(self)
Detector::Lcsh.record(self)
Expand Down
6 changes: 6 additions & 0 deletions db/seeds.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,14 @@
Detector.find_or_create_by(name: 'PMID')
Detector.find_or_create_by(name: 'Journal')
Detector.find_or_create_by(name: 'SuggestedResource')
Detector.find_or_create_by(name: 'Citation')

# DetectorCategories
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'Citation'),
category: Category.find_by(name: 'Transactional'),
confidence: 0.3
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'DOI'),
category: Category.find_by(name: 'Transactional'),
Expand Down
5 changes: 5 additions & 0 deletions test/fixtures/detector_categories.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,8 @@ six:
detector: lcsh
category: informational
confidence: 0.7

seven:
detector: citation
category: transactional
confidence: 0.3
3 changes: 3 additions & 0 deletions test/fixtures/detectors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
# created_at :datetime not null
# updated_at :datetime not null
#
citation:
name: 'Citation'

doi:
name: 'DOI'

Expand Down
3 changes: 3 additions & 0 deletions test/fixtures/terms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,6 @@ suggested_resource_jstor:

multiple_detections:
phrase: 'Environmental and Health Impacts of Air Pollution: A Review. Frontiers in Public Health. PMID: 32154200. DOI: 10.3389/fpubh.2020.00014'

citation:
phrase: "A. Altun, &quot;Understanding hypertext in the context of reading on the web: Language learners' experience,&quot; Current Issues in Education, vol. 6, no. 12, July, 2005. [Online serial]. Available: http://cie.ed.asu.edu/volume6/number12/. [Accessed Dec. 2, 2007]."
Loading