diff --git a/docs/source/appendices/ga4gh_identifiers.rst b/docs/source/appendices/ga4gh_identifiers.rst index 05ebe438..0a03e5b1 100644 --- a/docs/source/appendices/ga4gh_identifiers.rst +++ b/docs/source/appendices/ga4gh_identifiers.rst @@ -5,7 +5,7 @@ GA4GH Computed Identifier Alignment This appendix describes alignment on standard practices for for serializing data, computing digests on serialized data, and -constructing CURIE identifiers from the digests. Essentially, it is a +constructing CURIE identifiers from the digests. Essentially, it is a generalization of the :ref:`computed-identifiers` section. This mechanism for generating identifiers has been in place @@ -18,23 +18,23 @@ The GA4GH mission entails structuring, connecting, and sharing data reliably. A key component of this effort is to be able to *identify* entities, that is, to associate identifiers with entities. Ideally, there will be exactly one identifier for each entity, and one entity -for each identifier. Traditionally, identifiers are assigned to +for each identifier. Traditionally, identifiers are assigned to entities, which means that disconnected groups must coordinate on identifier assignment. -The computed identifier scheme used in VRS computes identifiers -from the data itself. Because identifers depend on the data, groups -that independently generate the same variation will generate the same -computed identifier for that entity, thereby obviating centralized -identifier systems and enabling identifiers to be used in isolated -settings such as clinical labs. +The computed identifier scheme used in VRS computes identifiers +from the data itself. Because identifiers depend on the data, groups +that independently generate the same variation will generate the same +computed identifier for that entity, thereby obviating centralized +identifier systems and enabling identifiers to be used in isolated +settings such as clinical labs. The computed identifier mechanism is broadly applicable and useful to -the entire GA4GH ecosystem. Adopting a common identifier scheme will +the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to -share tooling that manipulate identified data. In short, it provides +share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem. Here we detail alignment between VRS and other GA4GH products to work @@ -70,7 +70,7 @@ reference: GA4GH Digest Keys ################# When creating computed identifiers from objects, VRS uses a custom schema -attribute, ``ga4ghDigest``, that contains the keys used for filtering out +attribute, ``ga4ghDigest``, that contains the keys used for filtering out properties. For example, the Allele JSON Schema: .. parsed-literal:: @@ -95,8 +95,8 @@ properties. For example, the Allele JSON Schema: .. note:: - The `ga4ghDigest` property names are currently being aligned with the Sequence - Collections effort (see `SeqCol#84 `_) + The `ga4ghDigest` property names are currently being aligned with the Sequence + Collections effort (see `SeqCol#84 `_) and may potentially change. GA4GH Type Prefixes @@ -114,9 +114,9 @@ We use the following guidelines for type prefixes: * Prefixes SHOULD be short, approximately 2-4 characters. * Prefixes SHOULD be used only for concrete classes, not abstract parent classes. -* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`), +* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`), not classes that require additional context to be meaningful (e.g. :ref:`Range`, :ref:`SequenceExpression`) - or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`) + or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`) * A prefix MUST map 1:1 with a schema. Administration diff --git a/docs/source/appendices/glossary.rst b/docs/source/appendices/glossary.rst index 3a0d12d5..5340afd0 100644 --- a/docs/source/appendices/glossary.rst +++ b/docs/source/appendices/glossary.rst @@ -12,10 +12,10 @@ Glossary data. digest, ga4gh_digest - A digest is a digital fingerprint of a block of binary data. A + A digest is a digital fingerprint of a block of binary data. A digest is always the same size, regardless of the size of the - input data. It is statistically extremely unlikely for two - fingerprints to match when the underlying data are distinct. + input data. It is statistically extremely unlikely for two + fingerprints to match when the underlying data are distinct. identifiable object An identifiable object in VRS is any data structure for diff --git a/docs/source/appendices/truncated_digest_collision_analysis.rst b/docs/source/appendices/truncated_digest_collision_analysis.rst index 7e9d8e1a..3898482a 100644 --- a/docs/source/appendices/truncated_digest_collision_analysis.rst +++ b/docs/source/appendices/truncated_digest_collision_analysis.rst @@ -12,7 +12,7 @@ of truncation length. `__ in `Python SeqRepo library `__ for code and - updates. A fuller explanation is given in [Hart2020]_. + updates. A fuller explanation is given in [Hart2020]_. Conclusions @@ -30,11 +30,11 @@ Conclusions import hashlib import math import timeit - + from IPython.display import display, Markdown - + from ga4gh.vrs.extras.utils import _format_time - + algorithms = {'sha512', 'sha1', 'sha256', 'md5', 'sha224', 'sha384'} @@ -49,16 +49,16 @@ basis for the Truncated Digest. def blob(l): """return binary blob of length l (POSIX only)""" return open("/dev/urandom", "rb").read(l) - + def digest(alg, blob): md = hashlib.new(alg) md.update(blob) return md.digest() - + def magic_run1(alg, blob): t = %timeit -o digest(alg, blob) return t - + def magic_tfmt(t): """format TimeitResult for table""" return "{a} ± {s} ([{b}, {w}])".format( @@ -159,7 +159,7 @@ in a corpus is difficult. Instead, we first seek to solve for the digests are unique). Because are only two outcomes, :math:`P + P' = 1` or, equivalently, :math:`P = 1 - P'`. -For a corpus of size :math:`m=1`, the probabability that the digests of +For a corpus of size :math:`m=1`, the probability that the digests of all :math:`m=1` messages are unique is (trivially) 1: .. math:: P' = s/s = 1 @@ -167,7 +167,7 @@ all :math:`m=1` messages are unique is (trivially) 1: because there are :math:`s` ways to choose the first digest from among :math:`s` possible values without a collision. -For a corpus of size :math:`m=2`, the probabability that the digests of +For a corpus of size :math:`m=2`, the probability that the digests of all :math:`m=2` messages are unique is: .. math:: P' = 1 \times (\frac{s-1}{s}) @@ -211,7 +211,7 @@ The Taylor series expansion of the exponential function is .. math:: e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ... For :math:`|x| \ll 1`, the expansion is dominated by the first terms and -therecore :math:`e^x \approx 1 + x`. +therefore :math:`e^x \approx 1 + x`. In the above expression for :math:`P'`, note that the product term :math:`(s-i)/s` is equivalent to :math:`1-i/s`. Combining this with the @@ -270,13 +270,13 @@ collisions. - Assumptions - Source/Comparison * - exact - - :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}` + - :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}` - :math:`1-P'` - :math:`1 \le m\le s` - [1] * - Taylor approximation on #1 - :math:`e^{-m(m-1)/2s}` - - :math:`1-P'` + - :math:`1-P'` - :math:`m \ll s` - [1] * - Taylor approximation on #2 @@ -286,7 +286,7 @@ collisions. - [1] * - Large square approximation - :math:`1 - \frac{m^2}{2s}` - - :math:`\frac{m^2}{2s}` + - :math:`\frac{m^2}{2s}` - (same) - [2] (where :math:`s=2^n`) @@ -347,20 +347,20 @@ This equation is not used further in this analysis. def b2B3(b): """Convert bits b to Bytes, rounded up modulo 3 - + We report modulo 3 because the intent will be to use Base64 encoding, which is most efficient when inputs have a byte length modulo 3. (Otherwise, the resulting string is padded with characters that provide no information.) - + """ return math.ceil(b/8/3) * 3 - + def B(P, m): """return the number of bits needed to achieve a collision probability P for m messages - + Assumes m << 2^b. - + """ b = math.log2(m**2 / P) - 1 if b < 5 + math.log2(m): @@ -417,4 +417,3 @@ digest length (bytes) required for expected collision probability :math:`P` over | 1e+ | 39 | 39 | 36 | 36 | 33 | 33 | 30 | 30 | 30 | 27 | 27 | | 30 | | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ - diff --git a/docs/source/concepts/MolecularVariation/Adjacency.rst b/docs/source/concepts/MolecularVariation/Adjacency.rst index ddb44251..cb941cec 100644 --- a/docs/source/concepts/MolecularVariation/Adjacency.rst +++ b/docs/source/concepts/MolecularVariation/Adjacency.rst @@ -7,9 +7,9 @@ Adjacency The Adjacency class was added in v2 to describe structural variation. -The adjacency class is a core concept for structural variation, representing the junction point of +The adjacency class is a core concept for structural variation, representing the junction point of two adjoined molecules. This class can be used on its own (e.g. for junctions of chimeric transcript fusions) -or in higher order structures such as :ref:`DerivativMolecule` to represent molecules derived from multiple +or in higher order structures such as :ref:`DerivativeMolecule` to represent molecules derived from multiple adjacencies (e.g. for translocations). Definition and Information Model @@ -28,7 +28,7 @@ of the provided :ref:`SequenceReference`. These types of adjacencies are common can be found, for example, on either end of a chromosomal inversion. To represent this, the :ref:`SequenceLocation` used by each partner of the adjacency is defined using -only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content +only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content extends right (increases) on the :ref:`SequenceReference`, and defining the location by `end` means that the sequence content extends left (decreases) on the :ref:`SequenceReference`. @@ -41,18 +41,18 @@ sequence content extends left (decreases) on the :ref:`SequenceReference`. .. figure:: ../../images/ex_revcomp_breakpoint.png **An example Adjacency with a reverse complement partner.** The chromosome 1 sequence extends left from - position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left + position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left from position 10:36119127 and so is *also* defined by the location `start`. Reading left-to-right along this adjacency one would expect reference sequence up to the adjacency and reverse complement sequence following. Normalization ############# -Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in +Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in :ref:`adjacency-normalization`. Linker Sequences ################ -Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified +Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified with a :ref:`SequenceExpression`.` diff --git a/docs/source/concepts/MolecularVariation/CisPhasedBlock.rst b/docs/source/concepts/MolecularVariation/CisPhasedBlock.rst index 590121c9..d6a98629 100644 --- a/docs/source/concepts/MolecularVariation/CisPhasedBlock.rst +++ b/docs/source/concepts/MolecularVariation/CisPhasedBlock.rst @@ -4,7 +4,7 @@ Cis-Phased Block !!!!!!!!!!!!!!!! The Cis-Phased Block is a set of Alleles that are found *in-cis*: occurring -on the same physical molecule. The `CisPhasedBlock` structure is useful for +on the same physical molecule. The `CisPhasedBlock` structure is useful for representing genetic *Haplotypes*, which are commonly described with respect to locations on a gene, a set of nearby genes, or other physically proximal genetic markers that tend to be transmitted together. Unlike haplotypes, the @@ -13,14 +13,14 @@ genetic markers that tend to be transmitted together. Unlike haplotypes, the .. admonition:: New in v2 In VRS v1, a class with the same computational use as the `CisPhasedBlock` - was defined and named the `Haplotype` class. This term is not used to describe + was defined and named the `Haplotype` class. This term is not used to describe this concept in v2, as the use of the `Haplotype` name created confusion in the - community, due to the additional semantics of the term around genetic linkage - and ancestry. In practice, implmentations transitioning from v1 to v2 should + community, due to the additional semantics of the term around genetic linkage + and ancestry. In practice, implementations transitioning from v1 to v2 should find the `CisPhasedBlock` able to accommodate the same information content from v1 `Haplotypes`. Definition and Information Model @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -.. include:: ../../def/vrs/CisPhasedBlock.rst \ No newline at end of file +.. include:: ../../def/vrs/CisPhasedBlock.rst diff --git a/docs/source/concepts/MolecularVariation/index.rst b/docs/source/concepts/MolecularVariation/index.rst index 17c41b83..836a3cae 100644 --- a/docs/source/concepts/MolecularVariation/index.rst +++ b/docs/source/concepts/MolecularVariation/index.rst @@ -3,9 +3,9 @@ Molecular Variation !!!!!!!!!!!!!!!!!!! -VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide -variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence -rearrangments that form derivative molecules. +VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide +variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence +rearrangements that form derivative molecules. Collectively, these types of variation are called molecular variation. @@ -21,4 +21,3 @@ Collectively, these types of variation are called molecular variation. CisPhasedBlock Terminus DerivativeMolecule - \ No newline at end of file diff --git a/docs/source/concepts/SystemicVariation/index.rst b/docs/source/concepts/SystemicVariation/index.rst index 0f4cf40a..fed6c01c 100644 --- a/docs/source/concepts/SystemicVariation/index.rst +++ b/docs/source/concepts/SystemicVariation/index.rst @@ -3,9 +3,9 @@ Systemic Variation !!!!!!!!!!!!!!!!!! -VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide -variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence -rearrangments that form derivative molecules. +VRS currently covers many classes of variation that are defined on a contiguous molecule such as single nucleotide +variants (SNVs), multi-nucleotide variants (MNVs), indels, repeats, haplotypes, breakpoints, and sequence +rearrangements that form derivative molecules. Collectively, these types of variation are called molecular variation. @@ -15,5 +15,5 @@ Collectively, these types of variation are called molecular variation. .. toctree:: :titlesonly: - + CopyNumber diff --git a/docs/source/conf.py b/docs/source/conf.py index c62e560d..f2a99dae 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -69,7 +69,7 @@ def _parse_release_as_version(rls): # -- Options for HTML output ------------------------------------------------- -# The theme to use for HTML and HTML Help pages. See the documentation for +# The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = 'sphinx_rtd_theme' diff --git a/docs/source/conventions/computed_identifiers.rst b/docs/source/conventions/computed_identifiers.rst index 436b5cc1..af53211a 100644 --- a/docs/source/conventions/computed_identifiers.rst +++ b/docs/source/conventions/computed_identifiers.rst @@ -31,7 +31,7 @@ A VRS Computed Identifier for a VRS concept is computed as follows: a rationale. The following diagram depicts the operations necessary to generate a -computed identifier. These operations are described in detail in the +computed identifier. These operations are described in detail in the subsequent sections. @@ -41,17 +41,17 @@ subsequent sections. Serialization, Digest, and Computed Identifier Operations Entities are shown in gray boxes. Functions are denoted by bold - italics. The yellow, green, and blue boxes, corresponding to the + italics. The yellow, green, and blue boxes, corresponding to the ``sha512t24u``, ``ga4gh_digest``, and ``ga4gh_identify`` functions - respectively, depict the dependencies among functions. ``SHA512`` + respectively, depict the dependencies among functions. ``SHA512`` is `SHA-512`_ truncated to 24 bytes (192 bits), using the SHA-512 - initialization vector. base64url_ is the official name of the + initialization vector. base64url_ is the official name of the variant of `Base64`_ encoding that uses a URL-safe character set. [`figure source `__] .. note:: Most implementation users will need only the - ``ga4gh_identify`` function. We describe the + ``ga4gh_identify`` function. We describe the ``ga4gh_serialize``, ``ga4gh_digest``, and ``sha512t24u`` functions here primarily for implementers. @@ -63,29 +63,22 @@ Implementations MUST adhere to the following requirements: * Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH - Computed Identifiers. Implementations MUST NOT use any other + Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier. -* When computing identifiers, implementations MUST ensure that each - nested :ref:`Ga4ghIdentifiableObject` is referenced with a GA4GH +* When computing identifiers, implementations MUST ensure that each + nested :ref:`Ga4ghIdentifiableObject` is referenced with a GA4GH Computed Identifier. -.. note:: The GA4GH schema MAY be used with identifiers from any - namespace. For example, a SequenceLocation may be defined - using a `sequence_id` = ``refseq:NC_000019.10``. However, - an implementation of the Computed Identifier algorithm MUST - first translate sequence accessions to GA4GH RefGet ``SQ`` - accessions to be compliant with this specification. - .. admonition:: New in v2 - + In VRS v2, all objects now inherit from :ref:`Entity`, providing a - means by which common expressions and accessions for VRS objects can - be provided in other fields as decorative metadata, alongside object + means by which common expressions and accessions for VRS objects can + be provided in other fields as decorative metadata, alongside object IDs. Implementations may freely implement such fields without impacting computed identifiers. Implementations are therefore encouraged (but not - required) to use the ``ID`` field strictly for computed identifiers and + required) to use the *id* field strictly for computed identifiers and use decorative fields for alternate accessions, to reduce computational complexity. @@ -95,10 +88,10 @@ Digest Serialization @@@@@@@@@@@@@@@@@@@@ Digest serialization converts a VRS object into a binary representation -in preparation for computing a digest of the object. The Digest +in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will -also be identical. |VRS| provides validation tests to ensure +also be identical. |VRS| provides validation tests to ensure compliance. VRS uses the JSON Canonicalization Scheme (`RFC 8785`_) to @@ -106,7 +99,7 @@ serialize JSON data, and includes additional preprocessing steps to ensure computed digests are not impacted by decorative metadata. .. admonition:: New in V2 - + Beginning in VRS v2, object value data and descriptive metadata may be passed in the same object, providing a means for sharing commonly expected annotations (e.g. a "Ref Allele") on VRS objects. Read @@ -132,7 +125,7 @@ The second step is to JSON serialize the message content following the * exclude insignificant whitespace, as defined in `RFC8785§3.2.1 `__ * order all keys by Unicode Character Set values - * use predefined JSON control character codes when available, + * use predefined JSON control character codes when available, as defined in `RFC8785§3.2.2.1 `__ The criteria for the digest serialization method was that it must be @@ -145,39 +138,45 @@ language. .. code:: ipython3 - allele = models.Allele(location=models.SequenceLocation( - sequence_id="ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", - interval=simple_interval), - state=models.SequenceState(sequence="T")) + allele = models.Allele( + location=models.SequenceLocation( + end=44908822, + start=44908821, + sequenceReference=models.SequenceReference( + refgetAccession="SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl" + ) + ), + state=models.LiteralSequenceExpression(sequence=models.SequenceString("T")) + ) ga4gh_serialize(allele) Gives the following *binary* (UTF-8 encoded) data: .. parsed-literal:: - {"location":"u5fspwVbQ79QkX6GHLF8tXPCAXFJqRPx","state":{"sequence":"T","type":"SequenceState"},"type":"Allele"} + {"location":"wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz","state":{"sequence":"T","type":"LiteralSequenceExpression"},"type":"Allele"} For comparison, here is one of many possible JSON serializations of the same object: .. code:: ipython3 - allele.for_json() + allele.model_dump(exclude_none=True) .. parsed-literal:: { "location": { - "interval": { - "end": 44908822, + "type": "SequenceLocation", + "sequenceReference": { + "type": "SequenceReference", + "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl" + }, "start": 44908821, - "type": "SimpleInterval" - }, - "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", - "type": "SequenceLocation" + "end": 44908822 }, "state": { - "sequence": "T", - "type": "SequenceState" + "type": "LiteralSequenceExpression", + "sequence": "T" }, "type": "Allele" } @@ -190,7 +189,7 @@ Truncated Digest (sha512t24u) @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The sha512t24u truncated digest algorithm [Hart2020]_ computes an ASCII digest -from binary data. The method uses two well-established standard +from binary data. The method uses two well-established standard algorithms, the `SHA-512`_ hash function, which generates a binary digest from binary data, and a URL-safe variant of `Base64`_ encoding, which encodes binary data using printable characters. @@ -199,7 +198,7 @@ Computing the sha512t24u truncated digest for binary data consists of three steps: 1. Compute the `SHA-512`_ digest of a binary data. -2. Truncate the digest to the left-most 24 bytes (192 bits). See +2. Truncate the digest to the left-most 24 bytes (192 bits). See :ref:`truncated-digest-collision-analysis` for the rationale for 24 bytes. 3. Encode the truncated digest as a base64url_ ASCII string. @@ -254,11 +253,11 @@ Type prefixes used by VRS are: SL, SequenceLocation SQ, Sequence (`RefGet `_) -For example, the identifer for the allele example under :ref:`digest-serialization` gives: +For example, the identifier for the allele example under :ref:`digest-serialization` gives: .. parsed-literal:: - ga4gh\:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH\_ + ga4gh\:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt\_ References @@ -270,4 +269,4 @@ References e0239883. `doi:10.1371/journal.pone.0239883 `__ -.. _RFC 8785: https://datatracker.ietf.org/doc/html/rfc8785 \ No newline at end of file +.. _RFC 8785: https://datatracker.ietf.org/doc/html/rfc8785 diff --git a/docs/source/conventions/example.rst b/docs/source/conventions/example.rst index 7dfb3345..65e1e181 100644 --- a/docs/source/conventions/example.rst +++ b/docs/source/conventions/example.rst @@ -4,7 +4,7 @@ Example !!!!!!! This section provides a complete, language-neutral example of -essential features of VRS. In this example, we will translate an +essential features of VRS. In this example, we will translate an HGVS-formatted variant, ``NC_000019.10:g.44908822C>T``, into its VRS format and assign a globally unique identifier. @@ -19,66 +19,42 @@ reference nucleotide ``C`` to ``T``. In VRS, a contiguous change is represented using an :ref:`allele` object, which is composed of a :ref:`Location ` and of the -:ref:`State ` at that location. Location and State are +:ref:`State ` at that location. Location and State are abstract concepts: VRS is designed to accommodate many kinds of -Locations based on sequence position, gene names, cytogentic bands, or +Locations based on sequence position, gene names, cytogenetic bands, or other ways of describing locations. Similarly, State may refer to a specific sequence change, a contiguous repeated sequence, or a sequence derived from another source. In this example, we will use a :ref:`SequenceLocation`, which is -composed of a sequence identifier and a :ref:`SequenceInterval`. +composed of a :ref:`SequenceReference` and start and end coordinates. -In VRS, all identifiers are a |CURIE|. Therefore, NC_000013.11 MUST be -written as the string ``refseq:NC_000019.10`` to make explicit that -this sequence is from `RefSeq -`__ . VRS does not restrict -which data sources may be used, but does recommend using prefixes from -`identifiers.org `_. +In VRS, the :ref:`SequenceReference` object's *refgetAccession* +attribute MUST use a `GA4GH RefGet +`_ identifier. +Therefore, ``NC_000019.10`` MUST be written as the string +``SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl``. -VRS uses :ref:`inter-residue-coordinates-design`. Inter-residue -coordinates *always* use intervals to refer to sequence spans. For +VRS uses :ref:`inter-residue-coordinates-design`. Inter-residue +coordinates *always* use intervals to refer to sequence spans. For the purposes of this example, inter-residue coordinates *look* like the -more familiar 0-based, right-open numbering system. (Please read +more familiar 0-based, right-open numbering system. (Please read about :ref:`inter-residue-coordinates-design` if you are interested in the significant advantages of this design choice over other coordinate systems.) -The :ref:`SequenceInterval` for the position ``44908822`` is +The :ref:`SequenceLocation` for the position ``44908822`` is: .. code-block:: json { - "end": { - "type": "Number", - "value": 44908822 + "type": "SequenceLocation", + "sequenceReference": { + "type": "SequenceReference", + "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl" }, - "start": { - "type": "Number", - "value": 44908821 - }, - "type": "SequenceInterval" - } - -The :ref:`SequenceLocation` is constructed from a sequence identifier -and the above interval. - -.. code-block:: json - - { - "interval": { - "end": { - "type": "Number", - "value": 44908822 - }, - "start": { - "type": "Number", - "value": 44908821 - }, - "type": "SequenceInterval" - }, - "sequence_id": "refseq:NC_000019.10", - "type": "SequenceLocation" + "start": 44908821, + "end": 44908822 } A :ref:`LiteralSequenceExpression` object consists simply of the replacement sequence, as follows: @@ -98,19 +74,13 @@ LiteralSequenceExpressions respectively: { "location": { - "interval": { - "end": { - "type": "Number", - "value": 44908822 - }, - "start": { - "type": "Number", - "value": 44908821 - }, - "type": "SequenceInterval" + "type": "SequenceLocation", + "sequenceReference": { + "type": "SequenceReference", + "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl" }, - "sequence_id": "refseq:NC_000019.10", - "type": "SequenceLocation" + "start": 44908821, + "end": 44908822 }, "state": { "sequence": "T", @@ -125,14 +95,14 @@ VRS JSON Schema. .. note:: VRS is verbose! The goal of VRS is to provide a extensible framework for representation of sequence variation in - computers. VRS objects are readily parsable and have precise + computers. VRS objects are readily parsable and have precise meaning, but are often larger than other representations and - are typically less readable by humans. This tradeoff is + are typically less readable by humans. This tradeoff is intentional! -Generate a computed identifer +Generate a computed identifier @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ A key feature of VRS is an easily-implemented algorithm to @@ -145,67 +115,36 @@ labs). The VRS computed identifier procedure requires that all nested :term:`identifiable objects ` are expressed using -computed identifiers. Using GA4GH sequence identifiers collapses +computed identifiers. Using GA4GH sequence identifiers collapses differences between alleles due to trivial differences in reference -naming. The same variation reported on NC_000019.10, CM000681.2, +naming. The same variation reported on NC_000019.10, CM000681.2, GRCh38:19, GRCh38.p13:19 would appear to be distinct variation; using -a digest identifer will ensure that variation is reported on a single -sequence identifier. Furthermore, using digest-based sequence +a digest identifier will ensure that variation is reported on a single +sequence identifier. Furthermore, using digest-based sequence identifiers enables the use of custom reference sequences. .. important:: VRS permits the use of conventional sequence accessions - from RefSeq, Ensemble, or other sources. However, when - generating copmuted identifiers, implementations MUST - use GA4GH-sequence accessions. - -In this example, the sequence identifier ``refseq:NC_000019.10`` MUST -be transformed into digest-based identifer -``ga4gh:GS.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl`` as described in -:ref:`computed-identifiers`. In practice, implmentations should -precompute sequence digests or should use an existing service that -does so. (See :ref:`required-data` for a description of data that are -needed to implement VRS.) Subsitituing the GA4GH sequence identifier -into the Allele's ``location.sequence_id`` attribute gives: - -.. code-block:: json - - { - "location": { - "interval": { - "end": { - "type": "Number", - "value": 44908822 - }, - "start": { - "type": "Number", - "value": 44908821 - }, - "type": "SequenceInterval" - }, - "sequence_id": "ga4gh:GS.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", - "type": "SequenceLocation" - }, - "state": { - "sequence": "T", - "type": "LiteralSequenceExpression" - }, - "type": "Allele" - } - + from RefSeq, Ensembl, or other sources by annotating the + :ref:`SequenceReference` object's *id* attribute. When + generating computed identifiers, the + :ref:`SequenceReference` object's *refgetAccession* + attribute MUST use a `GA4GH RefGet + `_ + identifier. The first step in constructing a computed identifier is to create a -binary digest serialization of the Allele. Details are provided in -:ref:`computed-identifiers`. For this example, the *binary* (ASCII +binary digest serialization of the Allele. Details are provided in +:ref:`computed-identifiers`. For this example, the *binary* (ASCII encoded) object looks like: .. code-block:: text - - {"location":"esDSArZQC-Sx-96ZZzHnzAVNOc439oE5","state":{"sequence":"T","type":"LiteralSequenceExpression"},"type":"Allele"} + + {"location":"wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz","state":{"sequence":"T","type":"LiteralSequenceExpression"},"type":"Allele"} .. important:: The GA4GH binary digest serialization process imposes constraints that guarantee that different implementations will generate the same binary "blob" - for a given object. Do not confuse binary digest + for a given object. Do not confuse binary digest serialization with JSON serialization, which is used elsewhere with VRS schema. @@ -213,7 +152,7 @@ The GA4GH digest for the above blob is computed using the first 192 bits (24 bytes) of the `SHA-512`_ digest, `base64url`_ encoded. Conceptually, the function is ``base64url( sha512( blob )[:24] )``. In this example, the value returned is -``_YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc``. +``0AePZIWZUNsUlQTamyLrjm2HWUw2opLt``. A GA4GH Computed Identifier has the form:: @@ -222,12 +161,12 @@ A GA4GH Computed Identifier has the form:: The ``type_prefix`` for a VRS Allele is ``VA``, which results in the following computed identifier for our example:: - ga4gh:VA._YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc + ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt -Importantly, GA4GH computed identifers may be used literally (without +Importantly, GA4GH computed identifiers may be used literally (without escaping) in URIs. -Variation and Location objects contain an OPTIONAL ``_id`` attribute +Variation and Location objects contain an OPTIONAL *id* attribute which implementations may use to store any CURIE-formatted identifier. *If* an implementation returns a computed identifier with objects, the object might look like the following: @@ -235,36 +174,29 @@ object might look like the following: .. code-block:: json { - "_id": "ga4gh:VA._YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc", + "id": "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt", "location": { - "interval": { - "end": { - "type": "Number", - "value": 44908822 - }, - "start": { - "type": "Number", - "value": 44908821 - }, - "type": "SequenceInterval" + "type": "SequenceLocation", + "sequenceReference": { + "type": "SequenceReference", + "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl" }, - "sequence_id": "refseq:NC_000019.10", - "type": "SequenceLocation" + "start": 44908821, + "end": 44908822 }, "state": { - "sequence": "T", - "type": "LiteralSequenceExpression" - }, - "type": "Allele" + "type": "LiteralSequenceExpression", + "sequence": "T" + } } This example provides a full VRS-compliant Allele with a computed identifier. -.. note:: The ``_id`` attribute is optional. If it is used, the value - MUST be a CURIE, but it does NOT need to be a GA4GH Computed - Identifier. Applications MAY choose to implement their own - identifier scheme for private or public use. For example, - the above ``_id`` could be a serial number assigned by an +.. note:: The *id* attribute is optional. If it is used, the value + MUST be a string, but it does NOT need to be a GA4GH Computed + Identifier. Applications MAY choose to implement their own + identifier scheme for private or public use. For example, + the above *id* could be a serial number assigned by an application, such as ``acmecorp:v0000123``. @@ -274,18 +206,14 @@ What's Next? This example has shown a full example for a relatively simple case. VRS provides a framework that will enable much more complex variation. Please see :ref:`future-plans` for a discussion of variation classes -that are intened in the near future. +that are intended in the near future. The :ref:`implementations` section lists libraries and packages that implement VRS. VRS objects are `value objects -`__. An important +`__. An important consequence of this design choice is that data should be associated *with* VRS objects via their identifiers rather than embedded *within* -those objects. The appendix contains an example of :ref:`associating +those objects. The appendix contains an example of :ref:`associating annotations with variation `. - - - - diff --git a/docs/source/conventions/normalization.rst b/docs/source/conventions/normalization.rst index 46bd9ac3..c72c4949 100644 --- a/docs/source/conventions/normalization.rst +++ b/docs/source/conventions/normalization.rst @@ -10,12 +10,12 @@ of variation across systems. In the sequencing community, "normalization" refers to the process of converting a given sequence variant into a canonical form, typically -by left- or right-shuffling insertion/deletion variants. VRS +by left- or right-shuffling insertion/deletion variants. VRS normalization extends this concept to all classes of VRS Variation objects. Implementations MUST provide a normalize function that accepts *any* -Variation object and returns a normalized Variation. Guidelines for +Variation object and returns a normalized Variation. Guidelines for these functions are below. @@ -23,7 +23,7 @@ General Normalization Rules @@@@@@@@@@@@@@@@@@@@@@@@@@@ * Object types that do not have explicit VRS normalization rules below - are returned as-is. That is, all types of Variation MUST be + are returned as-is. That is, all types of Variation MUST be supported, even if such objects are unchanged. * VRS normalization functions are idempotent: Normalizing a previously-normalized object returns an equivalent object. @@ -148,9 +148,9 @@ the following normalization rules apply: #. If the Allele is an ambiguous insertion, determine if it is reference derived. - i. Determine the greatest factor `d` of the `seed length` such that `d` is less than or equal to the - length of the modified `reference sequence`, and there exists a subsequence of length `d` - derived from the modified `reference sequence` that can be circularly expanded to recreate + i. Determine the greatest factor `d` of the `seed length` such that `d` is less than or equal to the + length of the modified `reference sequence`, and there exists a subsequence of length `d` + derived from the modified `reference sequence` that can be circularly expanded to recreate the modified `alternate sequence`. #. If a valid factor `d` is found, the insertion is reference-derived. @@ -176,9 +176,9 @@ the following normalization rules apply: .. .. figure:: ../images/normalize.png -.. A demonstration of fully justifying an insertion allele. +.. A demonstration of fully justifying an insertion allele. -.. Reproduced from [2]_ +.. Reproduced from [2]_ **References** @@ -207,7 +207,7 @@ Adjacency Normalization when sequence on either side of the adjacency is homologous. This is addressed through expanding the region on both sides. Precise algorithm to be described. -When expressed on a double-stranded nucleic acid molecule, an adjacency can be represented in a forward +When expressed on a double-stranded nucleic acid molecule, an adjacency can be represented in a forward or reverse orientation. To ensure uniqueness of a computed identifier for these concepts, we require a convention for determining the preferred orientation of such adjacencies. The conventional orientation will be selected by meeting the following ordered criteria. @@ -215,4 +215,3 @@ will be selected by meeting the following ordered criteria. 1. The first of the adjoined sequences MUST have a forward orientation (location defined by `end`). 2. The adjoined sequence accessions are equal or in ascending lexicographical order. 3. The defined adjoined sequence coordinates are in ascending numerical order. - diff --git a/docs/source/conventions/required_data.rst b/docs/source/conventions/required_data.rst index 5d3bbfdf..ea3ae269 100644 --- a/docs/source/conventions/required_data.rst +++ b/docs/source/conventions/required_data.rst @@ -4,10 +4,10 @@ Required External Data !!!!!!!!!!!!!!!!!!!!!! All VRS implementations will require external data regarding -sequences and sequence metadata. The choices of data sources and -access methods are left to implementations. This section provides +sequences and sequence metadata. The choices of data sources and +access methods are left to implementations. This section provides guidance about how to implement required data and helps implementers -estimate effort. This section is descriptive only: it is not intended +estimate effort. This section is descriptive only: it is not intended to impose requirements on interface to, or sources of, external data. For clarity and completeness, this section also describes the contexts in which external data are used. @@ -32,14 +32,14 @@ Contexts * **Normalization** During :ref:`normalization`, implementations will need access to sequence length and sequence contexts. - + Data Services @@@@@@@@@@@@@ The following table summarizes data required in the above contexts: -.. list-table:: Data Service Desciptions +.. list-table:: Data Service Descriptions :header-rows: 1 :class: reece-wrap @@ -56,7 +56,7 @@ The following table summarizes data required in the above contexts: - normalization * - identifier translation - For a given sequence identifier and target namespace, return - all identifiers in the target namespace that are equivelent to + all identifiers in the target namespace that are equivalent to the given identifier. - Conversion to/from other formats @@ -88,47 +88,63 @@ The :ref:`impl-vrs-python` `DataProxy class `__ provides an example of this design pattern and sample replies. |vrs-python| implements the DataProxy interface using a local -|seqrepo| instance backend and using a |seqrepo_rs| backend. +|seqrepo| instance backend and using a |seqrepo_rs| backend. Examples ######## The following examples are taken from |notebooks|: +To create the SeqRepoDataProxy: + .. code:: ipython3 - from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy - seqrepo_rest_service_url = "http://localhost:5000/seqrepo" - dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url) - - def get_sequence(identifier, start=None, end=None): - """returns sequence for given identifier, optionally limited - to inter-residue interval""" - return dp.get_sequence(identifier, start, end) - def get_sequence_length(identifier): - """return length of given sequence identifier""" - return dp.get_metadata(identifier)["length"] - def translate_sequence_identifier(identifier, namespace): - """return for given identifier, return *list* of equivalent identifiers in given namespace""" - return dp.translate_sequence_identifier(identifier, namespace) + from ga4gh.vrs.dataproxy import create_dataproxy + seqrepo_rest_service_url = "seqrepo+https://services.genomicmedlab.org/seqrepo" + seqrepo_dataproxy = create_dataproxy(uri=seqrepo_rest_service_url) + +To get the RefGet accession from a public accession identifier: .. code:: ipython3 - get_sequence_length("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl") - 58617616 + seqrepo_dataproxy.derive_refget_accession("refseq:NM_002439.5") + 'SQ.Pw3Ch0x3XWD6ljsnIfmk_NERcZCI9sNM' + +To get sequence length, aliases, and other optional information for a given identifier: .. code:: ipython3 - start, end = 44908821-25, 44908822+25 - get_sequence("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", start, end) - 'CCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGTACCAGGCCGGGGC' + seqrepo_dataproxy.get_metadata("refseq:NM_000551.3") + {'added': '2016-08-24T05:03:11Z', + 'aliases': ['MD5:215137b1973c1a5afcf86be7d999574a', + 'NCBI:NM_000551.3', + 'refseq:NM_000551.3', + 'SEGUID:T12L0p2X5E8DbnL0+SwI4Wc1S6g', + 'SHA1:4f5d8bd29d97e44f036e72f4f92c08e167354ba8', + 'VMC:GS_v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_', + 'sha512t24u:v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_', + 'ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_'], + 'alphabet': 'ACGT', + 'length': 4560} + +To get the specified sequence or subsequence: .. code:: ipython3 - translate_sequence_identifier("GRCh38:19", "ga4gh") + identifier = "ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_" + seqrepo_dataproxy.get_sequence(identifier, start=0, end=51) + 'CCTCGCCTCCGTTACAACGGCCTACGGTGCTGGAGGATCCTTCTGCGCACG' + +To translate an identifier to a list of identifiers in the ga4gh namespace: + +.. code:: ipython3 + + seqrepo_dataproxy.translate_sequence_identifier("GRCh38:19", "ga4gh") ['ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'] +To translate an identifier to a list of identifiers in the GRCh38 namespace: + .. code:: ipython3 - translate_sequence_identifier("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", "GRCh38") + seqrepo_dataproxy.translate_sequence_identifier("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", "GRCh38") ['GRCh38:19', 'GRCh38:chr19'] diff --git a/docs/source/index.rst b/docs/source/index.rst index 363c0f6e..c616d6fa 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -3,7 +3,7 @@ GA4GH Variation Representation Specification The Variation Representation Specification (VRS, pronounced "verse") is a standard developed by the Global Alliance for Genomics and Health (GA4GH) to facilitate and -improve sharing of genetic information. The Specification consists of +improve sharing of genetic information. The Specification consists of a JSON Schema for representing many classes of genetic variation, conventions to maximize the utility of the schema, and a Python implementation that promotes adoption of the standard. @@ -13,7 +13,7 @@ implementation that promotes adoption of the standard. **The GA4GH Variation Representation Specification (VRS): a computational framework for variation representation and federated identification**. - Wagner AH, Babb L, Alterovitz G, Baudis M, Brush M, Cameron DL, ..., Hart RK. + Wagner AH, Babb L, Alterovitz G, Baudis M, Brush M, Cameron DL, ..., Hart RK. *Cell Genomics*. Volume 1 (2021). `doi:10.1016/j.xgen.2021.100027 `__ .. toctree:: diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst index fe1da100..1915eb16 100644 --- a/docs/source/introduction.rst +++ b/docs/source/introduction.rst @@ -22,7 +22,7 @@ Here we document the primary contributions of this specification for variation r OpenAPI, and GraphQL). The schema repository includes language-agnostic tests for ensuring schema compliance in downstream implementations. * **Conventions that promote reliable data sharing.** VRS recommends conventions regarding - the use of the schema and that facilitate data sharing. For example, VRS recommends + the use of the schema and that facilitate data sharing. For example, VRS recommends using fully justified allele normalization using an algorithm extending `NCBI's SPDI model `__. * **Globally unique computed identifiers.** This specification also recommends a specific algorithm @@ -32,7 +32,7 @@ Here we document the primary contributions of this specification for variation r * **A Python implementation.** We provide a Python package (`vrs-python `__) that demonstrates the above schema and algorithms, and supports translation of existing variant representation schemes into VRS for use in genomic data - sharing. It may be used as the basis for development in Python, but it is not required in order + sharing. It may be used as the basis for development in Python, but it is not required in order to use VRS. The machine readable schema definitions and example code are available online at the VRS diff --git a/docs/source/releases/1.1.rst b/docs/source/releases/1.1.rst index ccc5eb27..47757763 100644 --- a/docs/source/releases/1.1.rst +++ b/docs/source/releases/1.1.rst @@ -33,7 +33,7 @@ This patch version makes the following corrections and clarifications: New classes ########### - * ChromosomeLocation: A region of a chromosomed specified by species + * ChromosomeLocation: A region of a chromosome specified by species and name using cytogenetic naming conventions * CytobandInterval: A contiguous region specified by chromosomal bands features. * Haplotype: A set of zero or more Alleles. @@ -44,7 +44,7 @@ Other data model changes * Interval was renamed to SequenceInterval. Interval was an internal class that was never instantiated, so this change should not be - visiable to users. + visible to users. Documentation changes ##################### diff --git a/docs/source/releases/1.2.rst b/docs/source/releases/1.2.rst index 7b6e5e7a..2dc55ba3 100644 --- a/docs/source/releases/1.2.rst +++ b/docs/source/releases/1.2.rst @@ -33,8 +33,8 @@ Major Changes for certain technical operations * New :ref:`SequenceExpressions ` subclasses - replace SequenceState. Subtypes are: - + replace SequenceState. Subtypes are: + * :ref:`DerivedSequenceExpression`, which representations sequence notionally derived from a SequenceLocation * :ref:`RepeatedSequenceExpression`, which represents contiguous @@ -46,7 +46,7 @@ Major Changes copies of a molecule within a genome, and can be used to express concepts such as amplification and copy loss. * :ref:`Gene` enables reference to an external definition of a gene, - particularly for useas a subject of copy number expressions. + particularly for use as a subject of copy number expressions. * :ref:`DefiniteRange` and :ref:`IndefiniteRange` represent bounded and half-bounded ranges respectively. A new :ref:`Number` type wraps integers so that some attributes may assume values of any of @@ -57,5 +57,5 @@ Minor Changes ############# * Sequence strings are now formally defined by a :ref:`Sequence` - type, which is fundamentally also a string. This change aids + type, which is fundamentally also a string. This change aids documentation but has no technical impact. diff --git a/docs/source/releases/index.rst b/docs/source/releases/index.rst index 0f0af271..73c5bd57 100644 --- a/docs/source/releases/index.rst +++ b/docs/source/releases/index.rst @@ -1,7 +1,7 @@ Releases !!!!!!!! -.. note:: VRS follows `Semantic Versioning 2.0 `_. For a version +.. note:: VRS follows `Semantic Versioning 2.0 `_. For a version number MAJOR.MINOR.PATCH: * MAJOR version is incremented for incompatible API changes. @@ -10,7 +10,7 @@ Releases new types of variation or extend existing types. * PATCH version is incremented for bug fixes. For VRS, examples are clarifications of documentation and bug fixes on property - constraints. No changes to information models will occur in + constraints. No changes to information models will occur in PATCH releases. All planned work The `VRS Roadmap diff --git a/docs/source/schema.rst b/docs/source/schema.rst index d56ec9dc..64ec1279 100644 --- a/docs/source/schema.rst +++ b/docs/source/schema.rst @@ -20,7 +20,7 @@ Overview GA4GH Sequence strings (not shown). While all VRS objects are Value Objects, only some objects are intended to be identifiable (Variation, Location, and Sequence). Conceptual inheritance relationships between - classes is indicated by connecting lines. [`source + classes is indicated by connecting lines. [`source `__] @@ -35,7 +35,7 @@ The schema itself is written in YAML (|vrs_yaml|) and converted to JSON (|vrs_json|). Contributions to the schema MUST be written in the YAML document. - + .. |vrs_json| replace:: :download:`vrs.json <_static/vrs.json>` diff --git a/docs/source/style.rst b/docs/source/style.rst index f1b9421b..bad6562b 100644 --- a/docs/source/style.rst +++ b/docs/source/style.rst @@ -1,6 +1,6 @@ :orphan: -This page shows style conventions used in these docs. It will be +This page shows style conventions used in these docs. It will be built with other pages and you can view it (at docs/build/html/style.html). Unfortunately, because it's intentionally not included in the published docs, it generates the annoying warning @@ -37,7 +37,7 @@ For example:: To aid comprehension, the Text Styles section source looks like this:: .. _text-styles-target: - + Text Styles !!!!!!!!!!! @@ -79,4 +79,3 @@ A cheat sheet for making references, links, and literals in sphinx. * ````literal```` renders as ``literal`` * e.g., The :ref:`Allele` *type* attribute must be set to ``"Allele"``. - diff --git a/docs/source/terms_and_model.rst b/docs/source/terms_and_model.rst index 4509aec2..f2b643bc 100644 --- a/docs/source/terms_and_model.rst +++ b/docs/source/terms_and_model.rst @@ -51,26 +51,26 @@ Information Model Principles * **VRS objects are minimal** `value objects `_. Two objects are considered equal if and only if their respective attributes are - equal. As value objects, VRS objects are used as primitive types + equal. As value objects, VRS objects are used as primitive types and MUST NOT be used as containers for related data, such as primary database accessions, representations in particular formats, or links - to external data. Instead, related data should be associated with - VRS objects through identifiers. See :ref:`computed-identifiers`. + to external data. Instead, related data should be associated with + VRS objects through identifiers. See :ref:`computed-identifiers`. * **VRS uses polymorphism.** VRS uses polymorphism extensively in order to provide a coherent top-down structure for variation while - enabling precise models for variation data. For example, Allele is + enabling precise models for variation data. For example, Allele is a kind of Variation, SequenceLocation is a kind of Location, and - SequenceState is a kind of State. See :ref:`future-plans` for the - roadmap of VRS data classes and relationships. All VRS objects + SequenceState is a kind of State. See :ref:`future-plans` for the + roadmap of VRS data classes and relationships. All VRS objects contain a ``type`` attribute, which is used to discriminate polymorphic objects. * **Error handling is intentionally unspecified and delegated to implementation.** VRS provides foundational data types that - enable significant flexibility. Except where required by this + enable significant flexibility. Except where required by this specification, implementations may choose whether and how to - validate data. For example, implementations MAY choose to validate + validate data. For example, implementations MAY choose to validate that particular combinations of objects are compatible, but such validation is not required. @@ -79,4 +79,3 @@ Information Model Principles compound words.** Although the schema is currently JSON-based (which would typically use camelCase), VRS itself is intended to be neutral with respect to languages and database. -