SEP 033 -- Concrete descriptions of non-canonical DNA, RNA, and proteins

SEP	033
Title	Concrete descriptions of biopolymers
Authors	Jonathan Karr
Editor	Vishwesh Kulkarni
Type	Data Model
Status	Draft
Created	20-Jun-2019
Last modified	15-Jul-2019
Issue	#77

Abstract

To facilitate the understanding, reuse, and composition of genetic designs, SBOL needs to concretely capture the structure of each part. For example, SBOL should be able to capture the structures of proteins produced from expanded, non-canonical genetic codes. This requires the ability to represent non-canonical nucleic and amino acids that are synthesized via epigenetic, post-transcriptional, and post-translational modification; non-canonical amino acids that are incorporated via modified translational systems; and other covalent bonds such as crosslinks in DNA and disulfide bonds. Currently, SBOL does not have a clear mechanism to capture this information.

This SEP proposes to add BpForms as an optional recommended sequence encoding for capturing such chemical information. Specifically, this SEP proposes to add BpForms to the recommended sequence encodings in Table 1 of the SBOL specification.

1. Rationale
2. Specification
- 2.1 2.1 BpForms syntax
- 2.2 BpForms alphabets
- 2.3 Embedding BpForms into SBOL
3. Examples
- 3.1 Modified RNA: Bacillus subtilis tRNA^ILE 69
- 3.2 Modified Protein: Lipoate ligated Bacillus subtilis PdhC
4. Backwards Compatibility
5. Interpreting BpForms: Verifying BpForms and calculating their properties
6. Contributing to BpForms
7. Discussion
- 7.1 Verbosity of the BpForms syntax
- 7.2 Lack of support for the PSI-MI ontologies of monomeric forms of proteins
- 7.3 Integrating the BpForms software into libSBOL
- 7.4 Concretely representing macromolecular complexes
References
Copyright

1. Rationale

To facilitate the understanding, reuse, and composition of genetic designs, SBOL needs to concretely capture the structure of each part. This must include non-canonical nucleic and amino acids that are synthesized via epigenetic, post-transcriptional, and post-translational modification, as well as non-canonical amino acids incorporated via modified translational systems. SBOL should also capture other covalent bonds such as crosslinks in DNA and disulfide bonds.

This information is needed for several reasons:

This information can be used to identify similarities and differences between parts. For example, this information can help disambiguate between the multiple phosphorylated forms of ERK.
This information can help identify potential interactions between synthetic parts. For example, interactions between multiple phosphorylated forms of ERK and MEK.
This information can help identify the dependencies of each part that must be included when a part is introduced into another system. For example, PdhC of pyruvate dehydogenase requires a lipoate ligase. This dependence information is conveyed by capturing that PhdC includes a lipoylysine.

2. Specification

This SEP proposes to use the BpForms string format to concretely capture the primary structure of each DNA, RNA, and protein. BpForms is an enhancement over the IUPAC and IUBMB encodings which can describe non-canonical bases (both inline and via alphabets) and crosslinks between non-adjacent monomeric forms. BpForms includes three consensus alphabets of hundreds of non-canonical monomeric forms of DNA, RNA, and proteins. Detailed information about the BpForms syntax and alphabets is available at https://www.bpforms.org. This includes a formal definition of the BpForms grammar.

Specifically, this SEP proposes to add BpForms as an optional recommended sequence encoding to Table 1 of the SBOL specification. This would help SBOL support users that require more chemical information and precision.

Although BpForms can capture any sequence that can be encoded into IUPAC/IUBMB, at this time, we do not propose to replace the IUPAC/IUBMB encodings with BpForms because BpForms is, at least not yet, a standard.

2.1 `BpForms` syntax

BpForms uses an IUPAC/IUBMB-like syntax and alphabets to succinctly describe DNA, RNA, and proteins with non-canonical bases. See the BpForms documentation and grammar for detailed information and examples.

Monomeric forms with single-letter codes are indicated by their codes, e.g.
```
ACTGGA
```
Monomeric forms with multi-letter codes are indicated by their codes enclosed in curly brackets, e.g.
```
ACT{m2G}GA
```

Monomeric forms which are not already in the alphabets can be described "inline" in square brackets. These inline monomeric forms can also be used to capture uncertainty about the mass, charge, and position of non-canonical monomeric forms.

ACT[id: "AA0305"
  | name: "N5-methyl-L-arginine"
  | structure: "O=C[C@H](CCCN(C(=[NH2])N)C)[NH3+]"
  | backbone-bond-atom: C2
  | backbone-displaced-atom: H2
  | right-bond-atom: C2
  | left-bond-atom: N15-1
  | left-displaced-atom: H15+1
  | left-displaced-atom: H15+1
  ]A

Crosslinks can be described using the crosslink attribute

CAC | crosslink: [
  left-bond-atom: 1S1 |
  left-displaced-atom: 1H1 |
  right-bond-atom: 1S1 |
  right-displaced-atom: 1H1
  ]

2.1.1 Referencing positions within `BpForms`-encoded sequences

BpForms uses the same convention as IUPAC/IUBMB to assign unique location numbers to monomeric forms. The monomeric forms are numbered sequentially, starting from 1 at the left side. Following the grammar, the location numbers increase with each curly open bracket ({), square open bracket ([), or single character with represents a monomeric form.

BpForms uses the same convention as SMILES to assign unique location numbers to atoms with monomeric forms. The atoms are numbered sequentially, starting from 1 at the left side of the SMILES encoding, and increase with each uppercase character which represents an atom.

2.2 `BpForms` alphabets

BpForms includes three consensus alphabets of hundreds of non-canonical monomeric forms. These alphabets were built by quality controling and merging the best existing DNA, RNA, and protein alphabets. See the BpForms website for detailed information about the alphabets and the monomeric forms in each alphabet.

DNA nucleobases: The four canonical DNA nucleobases, plus the non-canonical DNA nucleobases in DNAmod
RNA nucleosides: The four canonical RNA nucleosides, plus the non-canonical RNA nucleosides in MODOMICS and the RNA Modification Database
Protein residues: The 20 canonical protein residues, plus the non-canonical protein residues in the PDB Chemical Component Dictionary and RESID

2.3 Embedding `BpForms` into SBOL

BpForms strings can be embedded within the elements attribute of Sequence objects. The following URIs should be used to indicate the encodings for the sequences of DNA, RNA, and protein parts.

DNA: http://edamontology.org/format_3909#dna
RNA: http://edamontology.org/format_3909#rna
Protein: http://edamontology.org/format_3909#protein

3. Examples

3.1 Modified RNA: Bacillus subtilis tRNA^ILE 69

The following example illustrates how to use BpForms to describe the modifications in Bacillus subtilis tRNA^ILE 69 (SynBioHub: BO_28687). The complete SBOL file is available here.

...
  <sbol:Sequence>
    <sbol:elements>
      GGGCCUGUAGCUCAGC{8U}GG{8U}{8U}AGAGCGCACGCCUGAU{62A}AGCGUGAG{7G}UCGAUGG{5U}{9U}CGAGUCCAUUCAGGCCCACCA
    </sbol:elements>
    <sbol:encoding rdf:resource="http://edamontology.org/format_3909#rna"/>
  </sbol:Sequence>
...

3.1 Modified Protein: Lipoate ligated Bacillus subtilis PdhC

The following example illustrates how to use BpForms to describe the lipoate ligation of the acetyltransferase component PdhC of B. subtilis pyruvate dehydrogenase complex (SynBioHub: BO_32431). The complete SBOL file is available here.

...
  <sbol:Sequence>
    <sbol:elements>
      MAFEFKLPDIGEGIHEGEIVKWFVKPNDEVDEDDVLAEVQND{AA0118}AVVEIPSPVKGKVLELKVEEGTVATVGQTIITFDAPGYEDLQFKGSDE
      SDDAKTEAQVQSTAEAGQDVAKEEQAQEPAKATGAGQQDQAEVDPNKRVIAMPSVRKYAREKGVDIRKVTGSGNNGRVVKEDIDSFVNGGAQEAAPQE
      TAAPQETAAKPAAAPAPEGEFPETREKMSGIRKAIAKAMVNSKHTAPHVTLMDEVDVTNLVAHRKQFKQVAADQGIKLTYLPYVVKALTSALKKFPVL
      NTSIDDKTDEVIQKHYFNIGIAADTEKGLLVPVVKNADRKSVFEISDEINGLATKAREGKLAPAEMKGASCTITNIGSAGGQWFTPVINHPEVAILGI
      GRIAEKAIVRDGEIVAAPVLALSLSFDHRMIDGATAQNALNHIKRLLNDPQLILMEA
    </sbol:elements>
    <sbol:encoding rdf:resource="http://edamontology.org/format_3909#protein"/>
  </sbol:Sequence>
...

4. Backwards Compatibility

Backward can be achieved in multiple ways:

Parts can have multiple instances of Sequence, one with BpForms and a second with the IUPAC or IUBMB encoding
The BpForms software can generate IUPAC and IUBMB strings from BpForms-encoded strings. This uses information about the parent of each monomeric form (e.g., the parent of selenocysteine is cysteine).
BpForms is also backward compatible with the MODOMICS notation for modified RNA and ProForma notation for modified proteins.

5. Interpreting `BpForms`: Verifying `BpForms` and calculating their properties

There are several tools for verifying BpForms and calculating properties such as their structure, formula, molecular weights, and charges. See the BpForms website for more information.

Web validation form
JSON REST API
Command line program
Python API

6. Contributing to `BpForms`

Contributions to the alphabets and software are welcome via Git pull requests.

7. Discussion

7.1 Verbosity of the `BpForms` syntax

BpForms intentionally uses an IUPAC/IUBMB-like syntax, which is verbose for several reasons

This provides maximal backward compatibility with the IUPAC and IUBMB encodings.
This enables BpForms to directly represent the structures of molecules distinct from the reactions which generate them. This avoids circular definitions of parts, which gives the opportunity to verify parts definitions against biosynthesis reactions.
This enables BpForms to represent any arbitrary part, including parts which are not based on naturally occurring genes.
This is compatible with the FASTA file format and programs for reading and writing FASTA files.
This format is also conducive to transcriptomics, proteomics, and systems biology research. In particular, this format is consistent with the MODOMICS syntax for modified RNA and the ProForma syntax for modified proteins. We anticipate that this will encourage broader adoption and compatibility across these fields.

7.2 Lack of support for the PSI-MI ontologies of monomeric forms of proteins

PSI-MI is the "standard" ontology for monomeric forms of proteins. However, because PSI-MI lacks concrete definitions of monomeric forms, we cannot integrate PSI-MI with our alphabet, the PDB-CCD, or RESID. To use PSI-MI, we would need to annotate the structure of each term and then link identical structures between the BpForms alphabet and PSI-MI.

7.3 Integrating the `BpForms` software into libSBOL

To maximize the utility of BpForms, libSBOL should be expanded to verify BpForms-encoded sequences and address positions within BpForms-encoded sequences. Verification of BpForms-encoded sequences could be implemented using the BpForms grammar. The grammar could also be used to parse the BpForms into a list, which could easily be used to address positions. Alternatively, positions within BpForms could be addressed with a simple function, since the positions increase monotonically from left to right. This would not require using the BpForms grammer.

Going forward, we recommend that the community discuss the desired level of integration with libSBOL because libSBOL currently only has minimal capabilities to interpret SMILES, IUPAC, and IUBMB-encoded sequences.

7.4 Concretely representing macromolecular complexes

To concretely describe every part, we plan to develop another format for concretely describing protein complexes, including the stoichiometry of each subunit and crosslinks between subunits.

7.5 Integrating `BpForms` into the RDF

Going forward, we think it would be useful to integrate BpForms more directly into SBOL/libSBOL. This could potentially be achieved by expanding the RDF data model to capture non-canonical covalent bonding, or potentiialy by incorporating the BpForms software into libSBOL. However, expanding the RDF data model would lead to verbose files. We are not proposing this at this time partly to follow the existing separation between SBOL and the currently recommended encodings (SMILES, IUPAC, IUBMB). Going forward, we recommend that the community discuss this issue after users have had time to develop use cases on top of BpForms and reflect on their needs.

References

[1]: Lang PF, Chebaro Y & Jonathan R. Karr. BpForms: a toolkit for concretely describing modified DNA, RNA, and proteins. arXiv:1903.10042. https://www.bpforms.org.

Copyright

To the extent possible under law, SBOL developers has waived all copyright and related or neighboring rights to SEP XXX. This work is published from: United States.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sep_033.md

sep_033.md

SEP 033 -- Concrete descriptions of non-canonical DNA, RNA, and proteins

Abstract

Table of Contents

1. Rationale

2. Specification

2.1 `BpForms` syntax

2.1.1 Referencing positions within `BpForms`-encoded sequences

2.2 `BpForms` alphabets

2.3 Embedding `BpForms` into SBOL

3. Examples

3.1 Modified RNA: Bacillus subtilis tRNA^ILE 69

3.1 Modified Protein: Lipoate ligated Bacillus subtilis PdhC

4. Backwards Compatibility

5. Interpreting `BpForms`: Verifying `BpForms` and calculating their properties

6. Contributing to `BpForms`

7. Discussion

7.1 Verbosity of the `BpForms` syntax

7.2 Lack of support for the PSI-MI ontologies of monomeric forms of proteins

7.3 Integrating the `BpForms` software into libSBOL

7.4 Concretely representing macromolecular complexes

7.5 Integrating `BpForms` into the RDF

References

Copyright

Files

sep_033.md

Latest commit

History

sep_033.md

File metadata and controls

SEP 033 -- Concrete descriptions of non-canonical DNA, RNA, and proteins

Abstract

Table of Contents

1. Rationale

2. Specification

2.1 BpForms syntax

2.1.1 Referencing positions within BpForms-encoded sequences

2.2 BpForms alphabets

2.3 Embedding BpForms into SBOL

3. Examples

3.1 Modified RNA: Bacillus subtilis tRNAILE 69

3.1 Modified Protein: Lipoate ligated Bacillus subtilis PdhC

4. Backwards Compatibility

5. Interpreting BpForms: Verifying BpForms and calculating their properties

6. Contributing to BpForms

7. Discussion

7.1 Verbosity of the BpForms syntax

7.2 Lack of support for the PSI-MI ontologies of monomeric forms of proteins

7.3 Integrating the BpForms software into libSBOL

7.4 Concretely representing macromolecular complexes

7.5 Integrating BpForms into the RDF

References

Copyright

2.1 `BpForms` syntax

2.1.1 Referencing positions within `BpForms`-encoded sequences

2.2 `BpForms` alphabets

2.3 Embedding `BpForms` into SBOL

3.1 Modified RNA: Bacillus subtilis tRNA^ILE 69

5. Interpreting `BpForms`: Verifying `BpForms` and calculating their properties

6. Contributing to `BpForms`

7.1 Verbosity of the `BpForms` syntax

7.3 Integrating the `BpForms` software into libSBOL

7.5 Integrating `BpForms` into the RDF