You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As described in #282, several recent volumes exhibit a problem where certain gaps—namely, a horizontal line under a segment of text that represents a word 'omitted' or 'to be filled in' as on a form—are omitted from TEI deliveries from our typesetter. The lines are present in the PDF but not in the TEI.
An omission like this is fiendishly difficult to detect.
That PR discovered a phenomenon that was commonly associated with this omission - a space preceding a punctuation character. It added Schematron rules to flag such cases. But this also flags false positives (sometimes simply typos), and isn't guaranteed to identify all such cases.
Plumb a PDF for detailed information about each text character, rectangle, and line... Works best on machine-generated, rather than scanned, PDFs.
One of the objects that pdfplumber reports on is "lines". Running the utility on a volume known to have blanks, I was happy to find that pdfplumber identifies these lines—or rather, all lines in our volumes: lines beneath running heads, footnote separators, underlined text in table headings. The common feature of the gap lines we're looking for is that they appear to all have a length of "30". I ran the utility on all volumes with PDFs and wrote an XQuery report to reveal the instances:
Selecting a volume, the report shows each page where a matching line was detected, alongside the corresponding TEI, to help us identify if the TEI needs to be fixed:
Further testing will be needed to confirm if we can count on the value of "30" for the length of lines. But this appears to be a promising approach for identifying these gaps.
As with the FRUS XPath Explorer, the tool can craft links that open oXygen to the exact location of the page shown, to facilitate editing of the source TEI document.
The text was updated successfully, but these errors were encountered:
As described in #282, several recent volumes exhibit a problem where certain gaps—namely, a horizontal line under a segment of text that represents a word 'omitted' or 'to be filled in' as on a form—are omitted from TEI deliveries from our typesetter. The lines are present in the PDF but not in the TEI.
An omission like this is fiendishly difficult to detect.
That PR discovered a phenomenon that was commonly associated with this omission - a space preceding a punctuation character. It added Schematron rules to flag such cases. But this also flags false positives (sometimes simply typos), and isn't guaranteed to identify all such cases.
As an alternative to a page-by-page review, a post in the DH Slack alerted me to a utility, pdfplumber, described as follows:
One of the objects that pdfplumber reports on is "lines". Running the utility on a volume known to have blanks, I was happy to find that pdfplumber identifies these lines—or rather, all lines in our volumes: lines beneath running heads, footnote separators, underlined text in table headings. The common feature of the gap lines we're looking for is that they appear to all have a length of "30". I ran the utility on all volumes with PDFs and wrote an XQuery report to reveal the instances:
Selecting a volume, the report shows each page where a matching line was detected, alongside the corresponding TEI, to help us identify if the TEI needs to be fixed:
Further testing will be needed to confirm if we can count on the value of "30" for the length of lines. But this appears to be a promising approach for identifying these gaps.
As with the FRUS XPath Explorer, the tool can craft links that open oXygen to the exact location of the page shown, to facilitate editing of the source TEI document.
The text was updated successfully, but these errors were encountered: