support post-translational modifications in FASTA #37

pkienzle · 2020-12-18T16:20:29Z

I would like to use NIST website (https://www.ncnr.nist.gov/resources/activation/) to calculate SLD of a protein containing phosphoserine. I have tried filling amino acid sequence of a protein on the website and used “J” for phosphoserine, however, it didn’t recognize the “J” as phosphoserine because I didn’t see any phosphorus in the chemical composition of the sample.
So I was wondering if there is another way to include phosphoserine on the website.

Looking at wikipedia, J is used in FASTA to represent either L or I,[1] so I average them 50:50.[2]

I see that there are a number of post-translational modifications that may occur,[3] but I don't know which formats can represent them. I can imagine extending FASTA with an optional lower case translation code after each sequence element. For example, phosphoserine could be Sp rather than S. This would be easy enough to parse, but I would rather not invent a new format if one already exists.

Once the format is defined, and the parser[4] updated, the residue table[5] will need to be extended with new codes, volumes, chemical formulae (including labile hydrogen and charge), and name.

[1] FASTA: https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation
[2] periodictable fasta 'J': https://github.com/pkienzle/periodictable/blob/master/periodictable/fasta.py#L351
[3] PTMs by residue: https://en.wikipedia.org/wiki/Posttranslational_modification#Common_PTMs_by_residue
[4] FASTA parser:

periodictable/periodictable/fasta.py

Lines 198 to 208 in 4fb8068

    
           def __init__(self, name, sequence, type='aa'): 
        
               codes = CODE_TABLES[type] 
        
               sequence = sequence.split('*', 1)[0]  # stop at first '*' 
        
               sequence = sequence.replace(' ', '')  # ignore spaces 
        
               parts = tuple(codes[c] for c in sequence) 
        
               cell_volume = sum(p.cell_volume for p in parts) 
        
               charge = sum(p.charge for p in parts) 
        
               structure = [] 
        
               for p in parts: 
        
                   structure.extend(list(p.labile_formula.structure)) 
        
               formula = parse_formula(structure).hill

[5] residue table:

periodictable/periodictable/fasta.py

Lines 320 to 354 in 4fb8068

    
           AMINO_ACID_CODES = dict(( 
        
               #code, volume, formula,        name 
        
               _("A",  91.5, "C3H4H[1]NO",    "alanine"), 
        
               #B: D or N 
        
               _("C", 105.6, "C3H3H[1]NOS",   "cysteine"), 
        
               _("D", 124.5, "C4H3H[1]NO3-",  "aspartic acid"), 
        
               _("E", 155.1, "C5H5H[1]NO3-",  "glutamic acid"), 
        
               _("F", 203.4, "C9H8H[1]NO",    "phenylalanine"), 
        
               _("G",  66.4, "C2H2H[1]NO",    "glycine"), 
        
               _("H", 167.3, "C6H5H[1]3N3O+", "histidine"), 
        
               _("I", 168.8, "C6H10H[1]NO",   "isoleucine"), 
        
               #J: L or I 
        
               _("K", 171.3, "C6H9H[1]4N2O+", "lysine"), 
        
               _("L", 168.8, "C6H10H[1]NO",   "leucine"), 
        
               _("M", 170.8, "C5H8H[1]NOS",   "methionine"), 
        
               _("N", 135.2, "C4H3H[1]3N2O2", "asparagine"), 
        
               #O: _("O", ???.?, "C12H21N3O3", "pyrrolysine") -- update X below 
        
               _("P", 129.3, "C5H7NO",     "proline"), 
        
               _("Q", 161.1, "C5H5H[1]3N2O2", "glutamine"), 
        
               _("R", 202.1, "C6H7H[1]6N4O+", "arginine"), 
        
               _("S",  99.1, "C3H3H[1]2NO2",  "serine"), 
        
               _("T", 122.1, "C4H5H[1]2NO2",  "threonine"), 
        
               #U: selenocysteine -- update X below 
        
               _("V", 141.7, "C5H8H[1]NO",    "valine"), 
        
               _("W", 237.6, "C11H8H[1]2N2O", "tryptophan"), 
        
               #X: any 
        
               _("Y", 203.6, "C9H7H[1]2NO2",  "tyrosine"), 
        
               #Z: E or Q 
        
               #-: gap 
        
               )) 
        
           _set_amino_acid_average('B', 'DN') 
        
           _set_amino_acid_average('J', 'LI') 
        
           _set_amino_acid_average('Z', 'EQ') 
        
           _set_amino_acid_average('X', 'ACDEFGHIKLMNPQRSTVWY', name='any') 
        
           _set_amino_acid_average('-', '', name='gap')

pkienzle · 2020-12-18T16:37:16Z

Meanwhile, you can do this in stages. Enter the fasta sequence and press calculate then type in

nHPO3 + sample formula @ density

where n is the number of phosphorylized serine and sample formula + density is printed by the first calculation. The density will be wrong, but probably within uncertainty since (a) the number of SEP will be small relative to the total sequence and (b) the computed density is already a poor approximation given that it assumes perfectly packed residue volumes regardless of protein conformation.

pkienzle · 2020-12-18T21:04:49Z

An short term fix would be to allow fasta sequences in mixtures so that [email protected] + aa:S would be one phosphoserine.

[1] H2O3P density: http://www.chemspider.com/Chemical-Structure.2341689.html?rid=352b4aa5-d266-4a1f-87c4-98f363fe67b8&page_num=0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support post-translational modifications in FASTA #37

support post-translational modifications in FASTA #37

pkienzle commented Dec 18, 2020

pkienzle commented Dec 18, 2020 •

edited

Loading

pkienzle commented Dec 18, 2020

support post-translational modifications in FASTA #37

support post-translational modifications in FASTA #37

Comments

pkienzle commented Dec 18, 2020

pkienzle commented Dec 18, 2020 • edited Loading

pkienzle commented Dec 18, 2020

pkienzle commented Dec 18, 2020 •

edited

Loading