Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: integrate extraneous text on positions somewhere in the docs #1

Open
claymcleod opened this issue Oct 9, 2024 · 0 comments
Open

Comments

@claymcleod
Copy link
Member

claymcleod commented Oct 9, 2024

As I was writing and editing the section on positions, I decided to remove quite a bit of content from the initial draft of the crate documentation. This content is useful in scope, but it may not be 100% correct as included here (the style I used when writing the crate was writing my own understanding of certain things and then doing quite a bit of research afterwards to ensure my statements were true). In this case of this content, some thoughts are even outright incomplete.

//! #### Extra
//!
//! * The notation 0-based and 1-based is not good because it takes away focus
//!   from the part of the intuition that actually matters (interbase vs
//!   in-base). Further, the assignment of interbase indices starting a 0 and
//!   in-base positions starting at zero in a informed, yet arbitrary, decision
//!   that is simply held together by convention.
//! * When you break the model down into nucleotide slots and space slots, it's
//!   clear that distinctions between "fully-closed" and "half-open" intervals
//!   are artifacts of trying to make space + nuceltodie slots fit into one
//!   concept, and they are only helpful in the context of resolving the
//!   interval to a nucleotide sequence.
//!     * The situation for in-base coordinates is quite clear: they include
//!       both the start and end nucleotide, and there is no compelling reason
//!       to do otherwise.
//!     * In a strict sense, whether interbase intervals are closed or open is
//!       undefined: this is largely due to that fact that we mostly care about
//!       resolving to nucleotide sequences and the spaces are filtered out, so
//!       we don't care if the last space is included or not (it has no bearing
//!       on what nucleotides are included in the sequence).
//! * Because interbase coordinates do not actually represent nucleotides, they
//!   must be resolved in a left-learning or right-learning manner (which is an
//!   assumption that should be explicitly denoted). In the UCSC model, they are
//!   silently assumpted to be right leaning, but the interval could be resolved
//!   by being left leaning and inclusive. They never mention this assumption,
//!   so it's a bad thing.
//!
//! #### Appendix
//!
//! Below are some scattered thoughts—many of which were written during the
//! original authorship of this documentation—that are related to genomic
//! coordinate systems. They are all tangentially related to the content above
//! but, during the editorial phase, were deemed to not fit the flow and purpose
//! of the documentation above. That said, they are still useful for individuals
//! who really want to dig into coordinate systems or design your own, so they
//! will remain included here. **You can skip this section if you're just
//! interested in learning about the crate.**
//!
//! ##### Intuitions That Arise From Transforming In-Base To Interbase
//!
//! Let's review the diagram of the conceptual model of coordinate systems.
//!
//! ```text
//! ========================== seq0 =========================
//! •   G   •   A   •   T   •   A   •   T   •   G   •   A   •
//! ║   ║   ║   ║   ║   ║   ║   ║   ║   ║   ║   ║   ║   ║   ║
//! ║[--1--]║[--2--]║[--3--]║[--4--]║[--5--]║[--6--]║[--7--]║ In-base Positions
//! 0       1       2       3       4       5       6       7 Interbase Positions
//! ```
//!
//! ```text
//! ========================== seq0 ===========================
//! •   G   •   A   •   T   •   A   •   T   •   G   •   A   •
//! ║       ║       ║       ║       ║       ║       ║       ║   In-base Positions
//! 0------>1------>2------>3------>4------>5------>6------>7-- Interbase Positions
//! ```
//!
//! Looking at this diagram and considering the ranges that represent the entire
//! sequence of nucleotides within both coordinate systems, it's clear that the
//! transformation from an in-base coordinate system can be broken down into two
//! main steps:
//!
//! * The range is shifted forward by half a step (one slot), and
//! * A whole step is prepended to the coordinate system.
//!
//! Another way of conceptualizing this is that the start and end of the in-base
//! range are expanded outward by one half step (one slot), though this
//! intuition doesn't naturally explain why the in-base coordinate system starts
//! at `1` and the interbase starts at `0`.
//!
//! This leads to a couple of key points:
//!
//! * Under both intuitions, the result is that the size of the range increases
//!   by one whole step. This is the key reason why the length of the range is
//!   "easier" to calculate in the interbase system (the length of the range is
//!   simply `end - start`, and you don't have to remember to add `1` to the
//!   result).
//! * If one chooses to combine the space slots and the nucleotide slots
//!   together as a single numbered position (as is often done in practice),
//!   each representation contains both a space slot and a nucleotide slot, but
//!   the order is switched:
//!     * The space slot is _before_ the nucleotide slot within the interbase
//!       system, but
//!     * The space slot is _after_ the nucleotide slot in the in-base system.
//! * The above point is, at least in part, the reason why combining the two
//!   slots together within a singular, numerical position often proves to be
//!   difficult to reason about: the rules change between the two coordinate
//!   systems, and one has to keep that accounting in mind.
@claymcleod claymcleod changed the title docs: integrate the following text on positions wherever is prudent docs: integrate extraneous text on positions somewhere in the docs Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant