SEP | 033 |
---|---|
Title | Concrete descriptions of biopolymers |
Authors | Jonathan Karr |
Editor | Vishwesh Kulkarni |
Type | Data Model |
Status | Draft |
Created | 20-Jun-2019 |
Last modified | 15-Jul-2019 |
Issue | #77 |
To facilitate the understanding, reuse, and composition of genetic designs, SBOL needs to concretely capture the structure of each part. For example, SBOL should be able to capture the structures of proteins produced from expanded, non-canonical genetic codes. This requires the ability to represent non-canonical nucleic and amino acids that are synthesized via epigenetic, post-transcriptional, and post-translational modification; non-canonical amino acids that are incorporated via modified translational systems; and other covalent bonds such as crosslinks in DNA and disulfide bonds. Currently, SBOL does not have a clear mechanism to capture this information.
This SEP proposes to add BpForms
as an optional recommended sequence encoding for capturing such chemical information. Specifically, this SEP proposes to add BpForms
to the recommended sequence encodings in Table 1 of the SBOL specification.
- 1. Rationale
- 2. Specification
- 2.1 2.1
BpForms
syntax - 2.2
BpForms
alphabets - 2.3 Embedding
BpForms
into SBOL
- 2.1 2.1
- 3. Examples
- 3.1 Modified RNA: Bacillus subtilis tRNAILE 69
- 3.2 Modified Protein: Lipoate ligated Bacillus subtilis PdhC
- 4. Backwards Compatibility
- 5. Interpreting
BpForms
: VerifyingBpForms
and calculating their properties - 6. Contributing to
BpForms
- 7. Discussion
- 7.1 Verbosity of the
BpForms
syntax - 7.2 Lack of support for the PSI-MI ontologies of monomeric forms of proteins
- 7.3 Integrating the
BpForms
software into libSBOL - 7.4 Concretely representing macromolecular complexes
- 7.1 Verbosity of the
- References
- Copyright
To facilitate the understanding, reuse, and composition of genetic designs, SBOL needs to concretely capture the structure of each part. This must include non-canonical nucleic and amino acids that are synthesized via epigenetic, post-transcriptional, and post-translational modification, as well as non-canonical amino acids incorporated via modified translational systems. SBOL should also capture other covalent bonds such as crosslinks in DNA and disulfide bonds.
This information is needed for several reasons:
- This information can be used to identify similarities and differences between parts. For example, this information can help disambiguate between the multiple phosphorylated forms of ERK.
- This information can help identify potential interactions between synthetic parts. For example, interactions between multiple phosphorylated forms of ERK and MEK.
- This information can help identify the dependencies of each part that must be included when a part is introduced into another system. For example, PdhC of pyruvate dehydogenase requires a lipoate ligase. This dependence information is conveyed by capturing that PhdC includes a lipoylysine.
This SEP proposes to use the BpForms
string format to concretely capture the primary structure of each DNA, RNA, and protein. BpForms
is an enhancement over the IUPAC and IUBMB encodings which can describe non-canonical bases (both inline and via alphabets) and crosslinks between non-adjacent monomeric forms. BpForms
includes three consensus alphabets of hundreds of non-canonical monomeric forms of DNA, RNA, and proteins. Detailed information about the BpForms
syntax and alphabets is available at https://www.bpforms.org. This includes a formal definition of the BpForms
grammar.
Specifically, this SEP proposes to add BpForms
as an optional recommended sequence encoding to Table 1 of the SBOL specification. This would help SBOL support users that require more chemical information and precision.
Although BpForms
can capture any sequence that can be encoded into IUPAC/IUBMB, at this time, we do not propose to replace the IUPAC/IUBMB encodings with BpForms
because BpForms
is, at least not yet, a standard.
BpForms
uses an IUPAC/IUBMB-like syntax and alphabets to succinctly describe DNA, RNA, and proteins with non-canonical bases. See the BpForms
documentation and grammar for detailed information and examples.
- Monomeric forms with single-letter codes are indicated by their codes, e.g.
ACTGGA
- Monomeric forms with multi-letter codes are indicated by their codes enclosed in curly brackets, e.g.
ACT{m2G}GA
- Monomeric forms which are not already in the alphabets can be described "inline" in square brackets. These inline monomeric forms can also be used to capture uncertainty about the mass, charge, and position of non-canonical monomeric forms.
ACT[id: "AA0305" | name: "N5-methyl-L-arginine" | structure: "O=C[C@H](CCCN(C(=[NH2])N)C)[NH3+]" | backbone-bond-atom: C2 | backbone-displaced-atom: H2 | right-bond-atom: C2 | left-bond-atom: N15-1 | left-displaced-atom: H15+1 | left-displaced-atom: H15+1 ]A
- Crosslinks can be described using the
crosslink
attributeCAC | crosslink: [ left-bond-atom: 1S1 | left-displaced-atom: 1H1 | right-bond-atom: 1S1 | right-displaced-atom: 1H1 ]
BpForms
uses the same convention as IUPAC/IUBMB to assign unique location numbers to monomeric forms. The monomeric forms are numbered sequentially, starting from 1 at the left side. Following the grammar, the location numbers increase with each curly open bracket ({
), square open bracket ([
), or single character with represents a monomeric form.
BpForms
uses the same convention as SMILES to assign unique location numbers to atoms with monomeric forms. The atoms are numbered sequentially, starting from 1 at the left side of the SMILES encoding, and increase with each uppercase character which represents an atom.
BpForms
includes three consensus alphabets of hundreds of non-canonical monomeric forms. These alphabets were built by quality controling and merging the best existing DNA, RNA, and protein alphabets. See the BpForms
website for detailed information about the alphabets and the monomeric forms in each alphabet.
- DNA nucleobases: The four canonical DNA nucleobases, plus the non-canonical DNA nucleobases in DNAmod
- RNA nucleosides: The four canonical RNA nucleosides, plus the non-canonical RNA nucleosides in MODOMICS and the RNA Modification Database
- Protein residues: The 20 canonical protein residues, plus the non-canonical protein residues in the PDB Chemical Component Dictionary and RESID
BpForms
strings can be embedded within the elements
attribute of Sequence
objects. The following URIs should be used to indicate the encodings for the sequences of DNA, RNA, and protein parts.
- DNA: http://edamontology.org/format_3909#dna
- RNA: http://edamontology.org/format_3909#rna
- Protein: http://edamontology.org/format_3909#protein
The following example illustrates how to use BpForms
to describe the modifications in Bacillus subtilis tRNAILE 69 (SynBioHub: BO_28687). The complete SBOL file is available here.
...
<sbol:Sequence>
<sbol:elements>
GGGCCUGUAGCUCAGC{8U}GG{8U}{8U}AGAGCGCACGCCUGAU{62A}AGCGUGAG{7G}UCGAUGG{5U}{9U}CGAGUCCAUUCAGGCCCACCA
</sbol:elements>
<sbol:encoding rdf:resource="http://edamontology.org/format_3909#rna"/>
</sbol:Sequence>
...
The following example illustrates how to use BpForms
to describe the lipoate ligation of the acetyltransferase component PdhC of B. subtilis pyruvate dehydrogenase complex (SynBioHub: BO_32431). The complete SBOL file is available here.
...
<sbol:Sequence>
<sbol:elements>
MAFEFKLPDIGEGIHEGEIVKWFVKPNDEVDEDDVLAEVQND{AA0118}AVVEIPSPVKGKVLELKVEEGTVATVGQTIITFDAPGYEDLQFKGSDE
SDDAKTEAQVQSTAEAGQDVAKEEQAQEPAKATGAGQQDQAEVDPNKRVIAMPSVRKYAREKGVDIRKVTGSGNNGRVVKEDIDSFVNGGAQEAAPQE
TAAPQETAAKPAAAPAPEGEFPETREKMSGIRKAIAKAMVNSKHTAPHVTLMDEVDVTNLVAHRKQFKQVAADQGIKLTYLPYVVKALTSALKKFPVL
NTSIDDKTDEVIQKHYFNIGIAADTEKGLLVPVVKNADRKSVFEISDEINGLATKAREGKLAPAEMKGASCTITNIGSAGGQWFTPVINHPEVAILGI
GRIAEKAIVRDGEIVAAPVLALSLSFDHRMIDGATAQNALNHIKRLLNDPQLILMEA
</sbol:elements>
<sbol:encoding rdf:resource="http://edamontology.org/format_3909#protein"/>
</sbol:Sequence>
...
Backward can be achieved in multiple ways:
- Parts can have multiple instances of
Sequence,
one withBpForms
and a second with the IUPAC or IUBMB encoding - The
BpForms
software can generate IUPAC and IUBMB strings fromBpForms
-encoded strings. This uses information about the parent of each monomeric form (e.g., the parent of selenocysteine is cysteine). BpForms
is also backward compatible with the MODOMICS notation for modified RNA and ProForma notation for modified proteins.
There are several tools for verifying BpForms
and calculating properties such as their structure, formula, molecular weights, and charges. See the BpForms
website for more information.
- Web validation form
- JSON REST API
- Command line program
- Python API
Contributions to the alphabets and software are welcome via Git pull requests.
BpForms
intentionally uses an IUPAC/IUBMB-like syntax, which is verbose for several reasons
- This provides maximal backward compatibility with the IUPAC and IUBMB encodings.
- This enables
BpForms
to directly represent the structures of molecules distinct from the reactions which generate them. This avoids circular definitions of parts, which gives the opportunity to verify parts definitions against biosynthesis reactions. - This enables
BpForms
to represent any arbitrary part, including parts which are not based on naturally occurring genes. - This is compatible with the FASTA file format and programs for reading and writing FASTA files.
- This format is also conducive to transcriptomics, proteomics, and systems biology research. In particular, this format is consistent with the MODOMICS syntax for modified RNA and the ProForma syntax for modified proteins. We anticipate that this will encourage broader adoption and compatibility across these fields.
PSI-MI is the "standard" ontology for monomeric forms of proteins. However, because PSI-MI lacks concrete definitions of monomeric forms, we cannot integrate PSI-MI with our alphabet, the PDB-CCD, or RESID. To use PSI-MI, we would need to annotate the structure of each term and then link identical structures between the BpForms
alphabet and PSI-MI.
To maximize the utility of BpForms,
libSBOL should be expanded to verify BpForms
-encoded sequences and address positions within BpForms
-encoded sequences. Verification of BpForms
-encoded sequences could be implemented using the BpForms grammar. The grammar could also be used to parse the BpForms
into a list, which could easily be used to address positions. Alternatively, positions within BpForms
could be addressed with a simple function, since the positions increase monotonically from left to right. This would not require using the BpForms
grammer.
Going forward, we recommend that the community discuss the desired level of integration with libSBOL because libSBOL currently only has minimal capabilities to interpret SMILES, IUPAC, and IUBMB-encoded sequences.
To concretely describe every part, we plan to develop another format for concretely describing protein complexes, including the stoichiometry of each subunit and crosslinks between subunits.
Going forward, we think it would be useful to integrate BpForms
more directly into SBOL/libSBOL. This could potentially be achieved by expanding the RDF data model to capture non-canonical covalent bonding, or potentiialy by incorporating the BpForms
software into libSBOL. However, expanding the RDF data model would lead to verbose files. We are not proposing this at this time partly to follow the existing separation between SBOL and the currently recommended encodings (SMILES, IUPAC, IUBMB). Going forward, we recommend that the community discuss this issue after users have had time to develop use cases on top of BpForms
and reflect on their needs.
[1]: Lang PF, Chebaro Y & Jonathan R. Karr. BpForms: a toolkit for concretely describing modified DNA, RNA, and proteins. arXiv:1903.10042. https://www.bpforms.org.
To the extent possible under law,
SBOL developers
has waived all copyright and related or neighboring rights to
SEP XXX.
This work is published from:
United States.