Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: resolve typos, only include one space after a period + remove trailing whitespace #536

Merged
merged 2 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions docs/source/appendices/ga4gh_identifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ GA4GH Computed Identifier Alignment

This appendix describes alignment on standard practices for
for serializing data, computing digests on serialized data, and
constructing CURIE identifiers from the digests. Essentially, it is a
constructing CURIE identifiers from the digests. Essentially, it is a
generalization of the :ref:`computed-identifiers` section.

This mechanism for generating identifiers has been in place
Expand All @@ -18,23 +18,23 @@ The GA4GH mission entails structuring, connecting, and sharing data
reliably. A key component of this effort is to be able to *identify*
entities, that is, to associate identifiers with entities. Ideally,
there will be exactly one identifier for each entity, and one entity
for each identifier. Traditionally, identifiers are assigned to
for each identifier. Traditionally, identifiers are assigned to
entities, which means that disconnected groups must coordinate on
identifier assignment.

The computed identifier scheme used in VRS computes identifiers
from the data itself. Because identifers depend on the data, groups
that independently generate the same variation will generate the same
computed identifier for that entity, thereby obviating centralized
identifier systems and enabling identifiers to be used in isolated
settings such as clinical labs.
The computed identifier scheme used in VRS computes identifiers
from the data itself. Because identifiers depend on the data, groups
that independently generate the same variation will generate the same
computed identifier for that entity, thereby obviating centralized
identifier systems and enabling identifiers to be used in isolated
settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to
the entire GA4GH ecosystem. Adopting a common identifier scheme will
the entire GA4GH ecosystem. Adopting a common identifier scheme will
make interoperability of GA4GH entities more obvious to consumers,
will enable the entire organization to share common entity definitions
(such as sequence identifiers), and will enable all GA4GH products to
share tooling that manipulate identified data. In short, it provides
share tooling that manipulate identified data. In short, it provides
an important consistency within the GA4GH ecosystem.

Here we detail alignment between VRS and other GA4GH products to work
Expand Down Expand Up @@ -70,7 +70,7 @@ reference:
GA4GH Digest Keys
#################
When creating computed identifiers from objects, VRS uses a custom schema
attribute, ``ga4ghDigest``, that contains the keys used for filtering out
attribute, ``ga4ghDigest``, that contains the keys used for filtering out
properties. For example, the Allele JSON Schema:

.. parsed-literal::
Expand All @@ -95,8 +95,8 @@ properties. For example, the Allele JSON Schema:

.. note::

The `ga4ghDigest` property names are currently being aligned with the Sequence
Collections effort (see `SeqCol#84 <https://github.com/ga4gh/refget/issues/84>`_)
The `ga4ghDigest` property names are currently being aligned with the Sequence
Collections effort (see `SeqCol#84 <https://github.com/ga4gh/refget/issues/84>`_)
and may potentially change.

GA4GH Type Prefixes
Expand All @@ -114,9 +114,9 @@ We use the following guidelines for type prefixes:

* Prefixes SHOULD be short, approximately 2-4 characters.
* Prefixes SHOULD be used only for concrete classes, not abstract parent classes.
* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`),
* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`),
not classes that require additional context to be meaningful (e.g. :ref:`Range`, :ref:`SequenceExpression`)
or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`)
or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`)
* A prefix MUST map 1:1 with a schema.

Administration
Expand Down
6 changes: 3 additions & 3 deletions docs/source/appendices/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ Glossary
data.

digest, ga4gh_digest
A digest is a digital fingerprint of a block of binary data. A
A digest is a digital fingerprint of a block of binary data. A
digest is always the same size, regardless of the size of the
input data. It is statistically extremely unlikely for two
fingerprints to match when the underlying data are distinct.
input data. It is statistically extremely unlikely for two
fingerprints to match when the underlying data are distinct.

identifiable object
An identifiable object in VRS is any data structure for
Expand Down
37 changes: 18 additions & 19 deletions docs/source/appendices/truncated_digest_collision_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ of truncation length.
<https://github.com/biocommons/biocommons.seqrepo/blob/master/docs/Truncated%20Digest%20Collision%20Analysis.ipynb>`__
in `Python SeqRepo library
<https://github.com/biocommons/biocommons.seqrepo>`__ for code and
updates. A fuller explanation is given in [Hart2020]_.
updates. A fuller explanation is given in [Hart2020]_.


Conclusions
Expand All @@ -30,11 +30,11 @@ Conclusions
import hashlib
import math
import timeit

from IPython.display import display, Markdown

from ga4gh.vrs.extras.utils import _format_time

algorithms = {'sha512', 'sha1', 'sha256', 'md5', 'sha224', 'sha384'}


Expand All @@ -49,16 +49,16 @@ basis for the Truncated Digest.
def blob(l):
"""return binary blob of length l (POSIX only)"""
return open("/dev/urandom", "rb").read(l)

def digest(alg, blob):
md = hashlib.new(alg)
md.update(blob)
return md.digest()

def magic_run1(alg, blob):
t = %timeit -o digest(alg, blob)
return t

def magic_tfmt(t):
"""format TimeitResult for table"""
return "{a} ± {s} ([{b}, {w}])".format(
Expand Down Expand Up @@ -159,15 +159,15 @@ in a corpus is difficult. Instead, we first seek to solve for
the digests are unique). Because are only two outcomes,
:math:`P + P' = 1` or, equivalently, :math:`P = 1 - P'`.

For a corpus of size :math:`m=1`, the probabability that the digests of
For a corpus of size :math:`m=1`, the probability that the digests of
all :math:`m=1` messages are unique is (trivially) 1:

.. math:: P' = s/s = 1

because there are :math:`s` ways to choose the first digest from among
:math:`s` possible values without a collision.

For a corpus of size :math:`m=2`, the probabability that the digests of
For a corpus of size :math:`m=2`, the probability that the digests of
all :math:`m=2` messages are unique is:

.. math:: P' = 1 \times (\frac{s-1}{s})
Expand Down Expand Up @@ -211,7 +211,7 @@ The Taylor series expansion of the exponential function is
.. math:: e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...

For :math:`|x| \ll 1`, the expansion is dominated by the first terms and
therecore :math:`e^x \approx 1 + x`.
therefore :math:`e^x \approx 1 + x`.

In the above expression for :math:`P'`, note that the product term
:math:`(s-i)/s` is equivalent to :math:`1-i/s`. Combining this with the
Expand Down Expand Up @@ -270,13 +270,13 @@ collisions.
- Assumptions
- Source/Comparison
* - exact
- :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}`
- :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}`
- :math:`1-P'`
- :math:`1 \le m\le s`
- [1]
* - Taylor approximation on #1
- :math:`e^{-m(m-1)/2s}`
- :math:`1-P'`
- :math:`1-P'`
- :math:`m \ll s`
- [1]
* - Taylor approximation on #2
Expand All @@ -286,7 +286,7 @@ collisions.
- [1]
* - Large square approximation
- :math:`1 - \frac{m^2}{2s}`
- :math:`\frac{m^2}{2s}`
- :math:`\frac{m^2}{2s}`
- (same)
- [2] (where :math:`s=2^n`)

Expand Down Expand Up @@ -347,20 +347,20 @@ This equation is not used further in this analysis.

def b2B3(b):
"""Convert bits b to Bytes, rounded up modulo 3

We report modulo 3 because the intent will be to use Base64 encoding, which is
most efficient when inputs have a byte length modulo 3. (Otherwise, the resulting
string is padded with characters that provide no information.)

"""
return math.ceil(b/8/3) * 3

def B(P, m):
"""return the number of bits needed to achieve a collision probability
P for m messages

Assumes m << 2^b.

"""
b = math.log2(m**2 / P) - 1
if b < 5 + math.log2(m):
Expand Down Expand Up @@ -417,4 +417,3 @@ digest length (bytes) required for expected collision probability :math:`P` over
| 1e+ | 39 | 39 | 36 | 36 | 33 | 33 | 30 | 30 | 30 | 27 | 27 |
| 30 | | | | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

12 changes: 6 additions & 6 deletions docs/source/concepts/MolecularVariation/Adjacency.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Adjacency

The Adjacency class was added in v2 to describe structural variation.

The adjacency class is a core concept for structural variation, representing the junction point of
The adjacency class is a core concept for structural variation, representing the junction point of
two adjoined molecules. This class can be used on its own (e.g. for junctions of chimeric transcript fusions)
or in higher order structures such as :ref:`DerivativMolecule` to represent molecules derived from multiple
or in higher order structures such as :ref:`DerivativeMolecule` to represent molecules derived from multiple
adjacencies (e.g. for translocations).

Definition and Information Model
Expand All @@ -28,7 +28,7 @@ of the provided :ref:`SequenceReference`. These types of adjacencies are common
can be found, for example, on either end of a chromosomal inversion.

To represent this, the :ref:`SequenceLocation` used by each partner of the adjacency is defined using
only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content
only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content
extends right (increases) on the :ref:`SequenceReference`, and defining the location by `end` means that the
sequence content extends left (decreases) on the :ref:`SequenceReference`.

Expand All @@ -41,18 +41,18 @@ sequence content extends left (decreases) on the :ref:`SequenceReference`.
.. figure:: ../../images/ex_revcomp_breakpoint.png

**An example Adjacency with a reverse complement partner.** The chromosome 1 sequence extends left from
position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left
position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left
from position 10:36119127 and so is *also* defined by the location `start`. Reading left-to-right along this
adjacency one would expect reference sequence up to the adjacency and reverse complement sequence following.

Normalization
#############

Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in
Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in
:ref:`adjacency-normalization`.

Linker Sequences
################

Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified
Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified
with a :ref:`SequenceExpression`.`
10 changes: 5 additions & 5 deletions docs/source/concepts/MolecularVariation/CisPhasedBlock.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Cis-Phased Block
!!!!!!!!!!!!!!!!

The Cis-Phased Block is a set of Alleles that are found *in-cis*: occurring
on the same physical molecule. The `CisPhasedBlock` structure is useful for
on the same physical molecule. The `CisPhasedBlock` structure is useful for
representing genetic *Haplotypes*, which are commonly described with respect
to locations on a gene, a set of nearby genes, or other physically proximal
genetic markers that tend to be transmitted together. Unlike haplotypes, the
Expand All @@ -13,14 +13,14 @@ genetic markers that tend to be transmitted together. Unlike haplotypes, the
.. admonition:: New in v2

In VRS v1, a class with the same computational use as the `CisPhasedBlock`
was defined and named the `Haplotype` class. This term is not used to describe
was defined and named the `Haplotype` class. This term is not used to describe
this concept in v2, as the use of the `Haplotype` name created confusion in the
community, due to the additional semantics of the term around genetic linkage
and ancestry. In practice, implmentations transitioning from v1 to v2 should
community, due to the additional semantics of the term around genetic linkage
and ancestry. In practice, implementations transitioning from v1 to v2 should
find the `CisPhasedBlock` able to accommodate the same information content
from v1 `Haplotypes`.

Definition and Information Model
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

.. include:: ../../def/vrs/CisPhasedBlock.rst
.. include:: ../../def/vrs/CisPhasedBlock.rst
7 changes: 3 additions & 4 deletions docs/source/concepts/MolecularVariation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Molecular Variation
!!!!!!!!!!!!!!!!!!!

VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide
variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence
rearrangments that form derivative molecules.
VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide
variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence
rearrangements that form derivative molecules.

Collectively, these types of variation are called molecular variation.

Expand All @@ -21,4 +21,3 @@ Collectively, these types of variation are called molecular variation.
CisPhasedBlock
Terminus
DerivativeMolecule

8 changes: 4 additions & 4 deletions docs/source/concepts/SystemicVariation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Systemic Variation
!!!!!!!!!!!!!!!!!!

VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide
variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence
rearrangments that form derivative molecules.
VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide
variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence
rearrangements that form derivative molecules.

Collectively, these types of variation are called molecular variation.

Expand All @@ -15,5 +15,5 @@ Collectively, these types of variation are called molecular variation.

.. toctree::
:titlesonly:

CopyNumber
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def _parse_release_as_version(rls):

# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
Expand Down
19 changes: 9 additions & 10 deletions docs/source/conventions/normalization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,20 @@ of variation across systems.

In the sequencing community, "normalization" refers to the process of
converting a given sequence variant into a canonical form, typically
by left- or right-shuffling insertion/deletion variants. VRS
by left- or right-shuffling insertion/deletion variants. VRS
normalization extends this concept to all classes of VRS Variation
objects.

Implementations MUST provide a normalize function that accepts *any*
Variation object and returns a normalized Variation. Guidelines for
Variation object and returns a normalized Variation. Guidelines for
these functions are below.


General Normalization Rules
@@@@@@@@@@@@@@@@@@@@@@@@@@@

* Object types that do not have explicit VRS normalization rules below
are returned as-is. That is, all types of Variation MUST be
are returned as-is. That is, all types of Variation MUST be
supported, even if such objects are unchanged.
* VRS normalization functions are idempotent: Normalizing a
previously-normalized object returns an equivalent object.
Expand Down Expand Up @@ -148,9 +148,9 @@ the following normalization rules apply:

#. If the Allele is an ambiguous insertion, determine if it is reference derived.

i. Determine the greatest factor `d` of the `seed length` such that `d` is less than or equal to the
length of the modified `reference sequence`, and there exists a subsequence of length `d`
derived from the modified `reference sequence` that can be circularly expanded to recreate
i. Determine the greatest factor `d` of the `seed length` such that `d` is less than or equal to the
length of the modified `reference sequence`, and there exists a subsequence of length `d`
derived from the modified `reference sequence` that can be circularly expanded to recreate
the modified `alternate sequence`.

#. If a valid factor `d` is found, the insertion is reference-derived.
Expand All @@ -176,9 +176,9 @@ the following normalization rules apply:

.. .. figure:: ../images/normalize.png

.. A demonstration of fully justifying an insertion allele.
.. A demonstration of fully justifying an insertion allele.

.. Reproduced from [2]_
.. Reproduced from [2]_

**References**

Expand Down Expand Up @@ -207,12 +207,11 @@ Adjacency Normalization
when sequence on either side of the adjacency is homologous. This is addressed through expanding
the region on both sides. Precise algorithm to be described.

When expressed on a double-stranded nucleic acid molecule, an adjacency can be represented in a forward
When expressed on a double-stranded nucleic acid molecule, an adjacency can be represented in a forward
or reverse orientation. To ensure uniqueness of a computed identifier for these concepts, we require
a convention for determining the preferred orientation of such adjacencies. The conventional orientation
will be selected by meeting the following ordered criteria.

1. The first of the adjoined sequences MUST have a forward orientation (location defined by `end`).
2. The adjoined sequence accessions are equal or in ascending lexicographical order.
3. The defined adjoined sequence coordinates are in ascending numerical order.

Loading
Loading