Bump to 0.36.1; Added more reference sequence documentation

barricklab · Nov 13, 2021 · 4113e06 · 4113e06
1 parent 2d21fb8
commit 4113e06
Show file tree

Hide file tree

Showing 4 changed files with 106 additions and 15 deletions.
diff --git a/ChangeLog b/ChangeLog
@@ -1,7 +1,11 @@
 Version 0.36.1
+  * Added an appendix to the manual on how reference sequence files are loaded/used.
   * When loading GenBank files that have a LOCUS line and source feature with different
     lengths, the LOCUS line length is now used to improve compatibility with sequences
     that have been edited in programs that do not update the source feature.
+  * Added a warning when CDS features with lengths that are not a multiple of 3 are loaded
+    with a suggestion to change these to pseudogenes. Changed a fatal error when annotating a
+    mutation in an incomplete final codon in one of these CDS features to a warning and generic coding mutation.
   * breseq will no longer throw a fatal error if read files that have the same name
     but are located at different paths are input.
   * Fixed gdtools MERGE not preserving header information from first GD file.

diff --git a/configure.ac b/configure.ac
@@ -19,7 +19,7 @@
 #                                               -*- Autoconf -*-
 # Process this file with autoconf to produce a configure script.
 AC_PREREQ([2.65])
-AC_INIT([breseq], [0.36.0], [[email protected]], [breseq], [http://barricklab.org/breseq])
+AC_INIT([breseq], [0.36.1], [[email protected]], [breseq], [http://barricklab.org/breseq])
 AC_CONFIG_AUX_DIR(aux_build)
 AC_CONFIG_MACRO_DIR([aux_build/m4])
 AC_CONFIG_HEADERS([aux_build/config.h])

diff --git a/src/c/breseq/reference_sequence.cpp b/src/c/breseq/reference_sequence.cpp
@@ -947,7 +947,7 @@ namespace breseq {
         cGeneFeature gf =  cGeneFeature(*(itg->get()));
 
         // Only check CDS features that do not have indeterminate ends
-        if ((gf.type == "CDS") && (!gf.m_start_is_indeterminate) && (!gf.m_end_is_indeterminate) ) {
+        if ((gf.type == "CDS") && (!gf.pseudogene) && (!gf.m_start_is_indeterminate) && (!gf.m_end_is_indeterminate) ) {
           string nt_seq = gf.get_nucleotide_sequence(as);
           if (nt_seq.size() % 3 != 0) {
             invalid_CDS_names.push_back(gf.get_locus_tag() + " (" + gf.name + ")");
@@ -957,7 +957,7 @@ namespace breseq {
     }
 
     if (invalid_CDS_names.size()>0) {
-      WARN("CDS feature(s) found with nucleotide length that are not a multiple of 3:\n" + join(invalid_CDS_names, ", ") + "\n\nTranslations of mutations in these genes may be incorrect.\nIt is recommended that you fix these feature annotations in your reference file!");
+      WARN("CDS feature(s) found with nucleotide length(s) that are not a multiple of 3:\n" + join(invalid_CDS_names, ", ") + "\n\nTranslations of mutations in these genes may be incorrect.\nIt is recommended that you fix these feature annotations in your reference file!\nAnother solution is to mark them as pseudogenes:\n  GenBank: add as '/pseudo' as a new line within the CDS feature\n  GFF3: add 'Pseudo=true' to the semicolon-separated list at the end of the CDS line.");
     }
   }
 

diff --git a/src/doc/refseq_format.rst b/src/doc/refseq_format.rst
@@ -1,25 +1,112 @@
 Reference Sequence Formats
 =============================
 
-This appendix explains the details of how |breseq| handles different reference sequence formats. Most importantly, this includes how different types of feature annotations are used.
+This appendix explains the details of how |breseq| handles different reference sequence formats. Most importantly, this includes how different types of feature annotations are used to improve mutation predictions.
 
-Illegal Characters
---------------------
+Each reference sequence file (the ``-r`` option to |breseq| and many |gdtools| subcommands) can contain **sequence** information (the nucleotide sequences of chromosomes or plasmids) and/or **annotations** (the locations and identities of features such as genes on those DNA sequences).
 
-For all sequence formats:
+Three types of input files are accepted for reference sequences:
 
-#. In nucleotide sequences, all characters are converted to uppercase and all non [ATCG] characters are converted to [N].
-#. In gene names and locus tags, the characters [,;/|] are replaced with [_].
-#. In gene descriptions, the character [|] is replaced with [;].
+* `GenBank Format <https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html>`_ (sequences and/or annotations)
+* `GFF3 Format <http://gmod.org/wiki/GFF3>`_ (sequences and/or annotations)
+* `FASTA Format <https://www.ncbi.nlm.nih.gov/genbank/fastaformat/>`_ (sequences only)
 
+Each loaded sequence is assigned a ``SEQ_ID`` as explained below. Sequences and their annotations can be input in different files as long as the ``SEQ_ID`` matches between files.
 
-Feature Annotations
-----------------------------
+.. note::
+  During a run, |breseq| merges and converts all input reference sequences into one annotated reference file that is output as ``data/reference.gff3``. If you are having trouble interpreting how |breseq| is loading your reference files, you should examine this file.
+
+Sequences
+------------------------
+The header of each reference sequence and sometimes a special feature corresponding to the entire sequence are loaded to determine the **sequence id** (``SEQ_ID``). The **length** of the sequence is also provided in the header for some formats. Any length provided here will be checked against the actual nucleotide sequence that is loaded. Finally, the **topology** of the sequence can be set to linear (default) or circular as described below in certain formats.
+
+FASTA
+^^^^^^^^
+The ``SEQ_ID`` is assigned as the first word on the sequence description line (i.e. all characters before encountering whitespace). Any later descriptive information on this line is ignored. Example:
+
+``>SEQ_ID DESCRIPTON``
+
+FASTA format does not support setting the sequence to have circular topology.
+
+GenBank
+^^^^^^^^
+
+The ``SEQ_ID`` is assigned with this order of preference from the ``LOCUS`` > ``ACCESSION`` > ``VERSION`` lines. This behavior can be overriden with the ``--genbank-field-for-seq-id`` command-line option, which can have the values ``AUTOMATIC``, ``LOCUS``, ``VERSION``, ``ACCESSION``
+
+The length provided in the ``LOCUS`` line  is used. If the ``source`` feature annotation has a different length, then a warning is shown. The ``LOCUS`` line will contain either ``LINEAR`` or ``CIRCULAR`` which sets the sequence topology.
 
-|breseq| is able to more accurately predict the locations of transposon insertions if these elements are annotated in the reference genome. They must have a feature type of ``repeat_region`` or ``mobile_element`` to be recognized.
+GFF3
+^^^^^
 
+The line that begins with ``##sequence-region`` has this whitespace delimited format:
 
-Gene Annotations
+``##sequence-region SEQ_ID START END``
+
+The ``SEQ_ID`` is taken from the first item. Then, the length is determined as ``END - START + 1``.
+
+Sequences with a circular topology have the attribute ``Is_circular=true`` for the ``region`` feature that corresponds to the entire sequence .
+
+Annotations
 ----------------------------
 
-|breseq| recognizes the following types of features: ``CDS``, ``rRNA``, ``tRNA``, ``ncRNA``, and ``RNA``. Features marked only with a ``gene`` field are not used on their own because it cannot be determined whether they encode a protein or are noncoding. Any information they For ``CDS``
+GenBank and GFF3 format support providing a list of feature annotations, which are sequence locations having start, end, and strand attributes that together define the bases constituting the feature.
+
+Each feature may be composed of a list of one or more nucleotide segments which may be discontiguous (for example, in the case of an ORF defined by multiple exons on a spliced RNA). In some cases, the start or end position of a feature may be indeterminate (ambiguous) because the sequence fragment is truncated (for example, in a *de novo* assembly or draft genome sequence).
+
+Both GenBank and GFF3 files define a type for each sequence feature and then have various information stored as key/value pairs.
+
+The types that |breseq| recognizes are:
+
+* ``CDS`` (protein-coding sequence)
+* ``fCDS`` (fragmented CDS, for pseudogenes)
+* ``rRNA`` (ribosomal RNA)
+* ``tRNA`` (transfer RNA)
+* ``ncRNA`` (noncoding RNA)
+* ``RNA`` (generic RNA)
+* ``mobile_element`` (transposon or other mobile DNA element)
+* ``repeat_region`` (transposon or other mobile DNA element)
+
+Features marked with a type that is only ``gene`` are not used on their own because it cannot be determined whether they encode a protein or are noncoding. If another identical feature of the other type exists, auxiliary information is loaded from the corresponding ```gene`` feature.
+
+|breseq| will annotate the effects of mutations differently in features that are marked as pseudogenes rather than normal coding sequence (CDS) features. Pseudogenes can be marked as described below in each format. If a CDS is encountered that does not have a length that is a multiple of three, |breseq| fill print a warning that suggests adding the pseudo tag to that feature.
+
+Internally, |breseq| tries to load three pieces of information describing each feature: ``name``, ``accession`` (like a unique ``locus_tag``), and ``description``.
+
+|breseq| is able to more accurately predict the locations of **transposon insertions** if these elements are annotated in the reference genome. They must have a feature type of ``repeat_region`` or ``mobile_element`` to be recognized. The ends of these features should correspond to the entire unit that is inserted when the DNA "moves" (e.g., encompassing the inverted repeats on the end of an IS element and everything between them, not just the transposase gene). If there are multiple copies of an element in the genome, then all of them should have the exact same name (correct: ``IS150`` and ``IS150``; incorrect: ``IS150a`` and ``IS150b``). This is important for letting |breseq| match up junction evidence from different copies.
+
+GenBank
+^^^^^^^^
+
+Genbank files can name features using many different tags. |breseq| uses this order of preference in deciding on the main name to use for a gene:
+
+The ``name`` for a feature is determined by |breseq| by checking in this order for ``/name=``, ``/locus_tag=``, ``/label=``, and then ``/note=`` tags.
+
+The ``accession`` is loaded from the ``/locus_tag=`` tag. (It may end up being the same as the ``name``.)
+
+The ``product`` for a feature is assigned from the ``/product=`` tag if it exists, and then from the ``/note=`` tag as a backup.
+
+Complex positions and indeterminate start/end positions are described in the line that gives the location of each feature according to the Genbank format specification.
+
+Pseudogenes are CDS features marked by adding a line that consistes solely of the ``/pseudo`` tag.
+
+GFF3
+^^^^^
+
+The ``name`` for a feature is determined by |breseq| by checking in this order for ``Name=``, ``gene=``, ``accession=`` attributes.
+
+The ``accession`` is loaded in order of preference from the first attribute that exists from ``accession=``, ``locus_tag=``, ``ID=`` or ``Alias=``.
+
+The ``product`` for a feature is assigned from the ``product=`` attribute if it exists, and then from the ``note=`` attribute as a backup.
+
+If multiple feature lines have identical accessions and types, then the locations from each one are concatenated together in one feature. This is how you represent a programmed frameshift or exons in a spliced gene, for example. Indeterminate (ambiguous) start/end coordinates for a segment are specified by adding an ``indeterminate_coordinate=start`` or ``indeterminate_coordinate=end`` as an attribute to the semicolon-delimited list on the line for a location.
+
+Pseudogenes are marked by adding ``Pseudo=true`` to the semicolon-delimited list of attributes at the end of the feature line line. Additionally, pseudogenes are reassigned a different feature type of ``fCDS``.
+
+Illegal Characters
+--------------------
+
+For all sequence formats:
+
+#. In nucleotide sequences, all characters are converted to uppercase and all non [``ATCG``] characters are converted to [``N``].
+#. In gene names and locus tags, the characters [``,;/\|``] are replaced with [``_``].
+#. In gene descriptions, the character [``|``] is replaced with [``;``].