Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidelines for handling page/line breaks within/between words #275

Open
joewiz opened this issue Apr 28, 2020 · 1 comment
Open

Add guidelines for handling page/line breaks within/between words #275

joewiz opened this issue Apr 28, 2020 · 1 comment

Comments

@joewiz
Copy link
Member

joewiz commented Apr 28, 2020

  • prevent inadvertent merging of two words on either end of a page/line break
  • clarify encoding practice to prevent these errors from being introduced
  • consider adding break=yes|no attribute (see https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.breaking.html)
  • add schema checks, raise errors when pb/lb is incorrectly placed (e.g., in cells; as first/last element in divs; etc.)
@joewiz
Copy link
Member Author

joewiz commented Dec 23, 2020

As developed in recent commits, we now have better - hopefully unambiguous - guidelines for pb and lb elements:

  • Purpose:
    • To facilitate retrieval, transformation, and analysis of encoded texts and associated page images, we capture all page breaks and certain line breaks in FRUS TEI using <pb> (page beginning) and <lb> (line beginning) elements. To ensure high fidelity between scanned pages and digital text and maximize TEI XML consistency and legibility, these elements should be arranged as follows:
  • Where to place the elements:
    • A page beginning element should be placed either (1) immediately before the division or text-bearing block element in which a new page's content begins or (2) at the point in the main stream of text (as opposed to within footnotes) where the new page begins. Record the explicit page number of the new page in the @n attribute (use square brackets if implicit or empty square brackets if the page number is not part of a page stream), the 4-digit padded page scan sequence in the @facs attribute, and the @xml:id as pg_ plus the page number (or pg-seq-____, replacing the underscores with the value of the @facs attribute). (Note: @break=yes|no is under consideration.)
    • A line beginning element is only used when a new line has semantic bearing, such as separating affiliations within closers or lines within a table cell that would otherwise run together and lose their meaning as distinct lines. The element should be placed precisely where the new line begins. In legacy FRUS TEI, the element has sometimes been placed after the complete word that is split, but this should be avoided for new text, since the whitespace guidelines allow us to distinguish between breaks within a word or between two words. No attributes are used. (Note: @break=yes|no is under consideration.)
    • Page and line beginning elements should not be placed first or last node in a text-bearing element. Instead, they should be placed immediately before or after the text-bearing element.
    • In tables, page beginning elements should be placed between rows, not within or between cells.
  • How to use whitespace around the elements:
    • A line or page beginning element should have whitespace either (1) both preceding and following it, in the case that the break does occurs between words, or (2) neither before nor after, when the break occurs in the middle of a word.
    • In the first case, strictly speaking, whitespace on either end would suffice, but having it on both aids in legibility.
  • Schema checks that flag related issues:
    • "A line or page break element should have whitespace both before and after, or neither when the break splits a word"
    • "An lb or pb may not be the first or last element in a div, head, p, table, cell, list, item, quote, signed, or frus:attachment"
  • A transformation scenario, "Fix pb/lb placement; fix whitespace issues", which among other features:
    • Adds whitespace when needed before or after pb or lb elements that already have space before or after
    • Moves leading/trailing pb/lb elements inside table cells out into the parent row
    • Moves leading/trailing pb/lb elements inside text-bearing elements before or after the element

To do:

  • Adapt this prose to frus.odd
  • Schema checks relating to pb/lb elements in the middle of a word. There are a lot of pb/lb that need whitespace to conform to the principle above, but we can't force whitespace in bulk, lest we introduce whitespace within a word. Perhaps we should insert @break="no" attributes into pb/lb elements when we are positive they split a single word, and then we could insert whitespace around the others. Further analysis is needed, but the goal is to avoid manual review if possible, since there are hundreds of thousands of these elements in the corpus.

As we work with these new schema and transformation facilities, please share comments or concerns, so we can refine them and put the resulting guidance into frus.odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant