Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support post-translational modifications in FASTA #37

Open
pkienzle opened this issue Dec 18, 2020 · 2 comments
Open

support post-translational modifications in FASTA #37

pkienzle opened this issue Dec 18, 2020 · 2 comments

Comments

@pkienzle
Copy link
Owner

I would like to use NIST website (https://www.ncnr.nist.gov/resources/activation/) to calculate SLD of a protein containing phosphoserine. I have tried filling amino acid sequence of a protein on the website and used “J” for phosphoserine, however, it didn’t recognize the “J” as phosphoserine because I didn’t see any phosphorus in the chemical composition of the sample.
So I was wondering if there is another way to include phosphoserine on the website.

Looking at wikipedia, J is used in FASTA to represent either L or I,[1] so I average them 50:50.[2]

I see that there are a number of post-translational modifications that may occur,[3] but I don't know which formats can represent them. I can imagine extending FASTA with an optional lower case translation code after each sequence element. For example, phosphoserine could be Sp rather than S. This would be easy enough to parse, but I would rather not invent a new format if one already exists.

Once the format is defined, and the parser[4] updated, the residue table[5] will need to be extended with new codes, volumes, chemical formulae (including labile hydrogen and charge), and name.

[1] FASTA: https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation
[2] periodictable fasta 'J': https://github.com/pkienzle/periodictable/blob/master/periodictable/fasta.py#L351
[3] PTMs by residue: https://en.wikipedia.org/wiki/Posttranslational_modification#Common_PTMs_by_residue
[4] FASTA parser:

def __init__(self, name, sequence, type='aa'):
codes = CODE_TABLES[type]
sequence = sequence.split('*', 1)[0] # stop at first '*'
sequence = sequence.replace(' ', '') # ignore spaces
parts = tuple(codes[c] for c in sequence)
cell_volume = sum(p.cell_volume for p in parts)
charge = sum(p.charge for p in parts)
structure = []
for p in parts:
structure.extend(list(p.labile_formula.structure))
formula = parse_formula(structure).hill

[5] residue table:
AMINO_ACID_CODES = dict((
#code, volume, formula, name
_("A", 91.5, "C3H4H[1]NO", "alanine"),
#B: D or N
_("C", 105.6, "C3H3H[1]NOS", "cysteine"),
_("D", 124.5, "C4H3H[1]NO3-", "aspartic acid"),
_("E", 155.1, "C5H5H[1]NO3-", "glutamic acid"),
_("F", 203.4, "C9H8H[1]NO", "phenylalanine"),
_("G", 66.4, "C2H2H[1]NO", "glycine"),
_("H", 167.3, "C6H5H[1]3N3O+", "histidine"),
_("I", 168.8, "C6H10H[1]NO", "isoleucine"),
#J: L or I
_("K", 171.3, "C6H9H[1]4N2O+", "lysine"),
_("L", 168.8, "C6H10H[1]NO", "leucine"),
_("M", 170.8, "C5H8H[1]NOS", "methionine"),
_("N", 135.2, "C4H3H[1]3N2O2", "asparagine"),
#O: _("O", ???.?, "C12H21N3O3", "pyrrolysine") -- update X below
_("P", 129.3, "C5H7NO", "proline"),
_("Q", 161.1, "C5H5H[1]3N2O2", "glutamine"),
_("R", 202.1, "C6H7H[1]6N4O+", "arginine"),
_("S", 99.1, "C3H3H[1]2NO2", "serine"),
_("T", 122.1, "C4H5H[1]2NO2", "threonine"),
#U: selenocysteine -- update X below
_("V", 141.7, "C5H8H[1]NO", "valine"),
_("W", 237.6, "C11H8H[1]2N2O", "tryptophan"),
#X: any
_("Y", 203.6, "C9H7H[1]2NO2", "tyrosine"),
#Z: E or Q
#-: gap
))
_set_amino_acid_average('B', 'DN')
_set_amino_acid_average('J', 'LI')
_set_amino_acid_average('Z', 'EQ')
_set_amino_acid_average('X', 'ACDEFGHIKLMNPQRSTVWY', name='any')
_set_amino_acid_average('-', '', name='gap')

@pkienzle
Copy link
Owner Author

pkienzle commented Dec 18, 2020

Meanwhile, you can do this in stages. Enter the fasta sequence and press calculate then type in

nHPO3 + sample formula @ density

where n is the number of phosphorylized serine and sample formula + density is printed by the first calculation. The density will be wrong, but probably within uncertainty since (a) the number of SEP will be small relative to the total sequence and (b) the computed density is already a poor approximation given that it assumes perfectly packed residue volumes regardless of protein conformation.

@pkienzle
Copy link
Owner Author

An short term fix would be to allow fasta sequences in mixtures so that [email protected] + aa:S would be one phosphoserine.

[1] H2O3P density: http://www.chemspider.com/Chemical-Structure.2341689.html?rid=352b4aa5-d266-4a1f-87c4-98f363fe67b8&page_num=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant