PTM Stoichiometry #797

pcruzparri · 2024-08-27T19:29:32Z

Creating a mzLib method to calculate the stoichiometry (or site-occupancy) of PTMs using the intensity of each quantified peak. The current inputs are the protein database(s) file(.xml) paths and the AllQuantifiedPeaks.tsv file path. The output, occupancyDict, is currently a dictionary of nested dictionaries with the following structure:

{{string PROTEIN1, {{int MAA1, {{string MODNAME1, double INTENSITY}, 
                                {string MODNAME2, double INTENSITY},
                                ...,
                                {string "Total", double INTENSITY}}} 
                    {int MAA2, {...}}, 
                   ...}},
 {string PROTEIN2, {...}},
 ...}

where PROTEINX is the protein accession, MAAX is the modified amino acid at protein position X, and MODNAME1 is the full label of the modification. For each MAAX, there is a "Total" key (instead of a modification name) that holds the total intensity of that amino acid measured in the quantified peaks file, including modified and unmodified peptides with that specific residue.

The general approach is to first get all of the modification intensities and record those in occupancyDict while storing in proteinSeqRangesSeen a dictionary with protein accession keys and values stored as a list of (STARTINDEX, ENDINDEX, INTENSITY) tuples. This helps keep track of the index ranges seen for each protein. Once we have parsed all of the mods, for every amino acid falling into any of those ranges, we increase its "Total" intensity by that amount.

From our discussion, I've added below some of the items I'd like to get some opinions about. Imade them a task list primarily for me to keep track of what I've figured out.

Where should this code live in mzLib. The most reasonable suggestions so far are in FlashLFQResults and Readers/QuantificationResults.
To interface this nicely with MetaMorpheus, what should the inputs be? My goal now is to look into how/where this will be integrated into MM, but any suggestions on places to look to figure this out are appreciated.
I have some ideas on making the code more efficient/succinct, especially foreseeing a lot more information about the peaks being readily available in MM (like the exact protein index for a peptide/peak). Any new ideas are welcomed.

Thanks in advance!

…ve amino acid positions depending on the length for the modification string and its index. Current approach fixes that.

Alexander-Sol · 2024-08-28T19:02:37Z

mzLib/Test/FileReadingTests/TestPsmFromTsv.cs

+
+                        // get the localized modifications from the peptide full sequence and add any amino acid/modification combination not
+                        // seen yet to the occupancy dictionary
+                        foreach (KeyValuePair<int, List<string>> aaWithModList in peptideMods)


In situations like this, you can use "var aaWithModList" instead of specifying the actual class

nbollis

I think readers/Quant... is the best place for it. That way it can be used to find occupancy of the results from another software should that be desired.

In order to optimize your inputs and outputs of the function, you should break your test method into two. One test method with reads in all the data you need. Another method (not a test method) that gets called to calculate the occupancy. This will help you to better understand what is needed for the method, and for use to help make recommendations

…o ptm_stoich Updating with remote

…LibUtil method for calculating a generalized occupancy. The flashLFQ caluculation will call that and use intensity values for quantification.

…o ptm_stoich

…in MzLibUtil.PositionFrequencyAnalysis. ParseModifications and RemoveSpecialCharacters methods from Omics were moved to MzLibUtil. FlashLFQResults now implements a CalculatePTMOccupancy method that populates its ModInfo property. FlashLFQEngine calls the FlashLFQResults Method after the peptide and protein quantification. Still need to finish testing the FlashLFQResults and FlashLFQEngine outputs.

…arseModificatons in the Omics folder to be consistent with previous testing.

pcruzparri · 2024-10-11T16:55:23Z

Requesting a second round of reviews! The second to last commit contains a little more in detail most changes. Currently pending work is to create a small enough subset of the raw data to create a test similar to the TestFlashLFQoutputRealData() test. More rigorous testing can be done with some of the identifications in the vignette data, since some base sequences have enough variations in fullSequence mods and positions to have better case coverage.

I'd be happy to hear about 1) code optimization, 2) currently written tests, and 3) clarifications on code commenting. In a conversation, Nic suggested using objects for my main ptm calculation code rather than the 5-level deep dictionary, thoughts on that would be useful as well. Ofc, anything else is useful. TIA!

…ementations of the occupancy code due to issues with the PercolatorStyleIds(issue: peptide object did not have a ase sequence) and MatchBetweenRuns(issue: peptide marked for quantification not stored with an Peptide object) tests. Noticed some of the averaging tests were failing (issue: cleanup problem to to new directory names in TestOutputToCustomDirectoryAndNameMzML()), so I patched that, too.

codecov · 2024-10-11T22:23:36Z

Codecov Report

Attention: Patch coverage is 92.05500% with 104 lines in your changes missing coverage. Please review.

Project coverage is 76.51%. Comparing base (983c3b0) to head (b146768).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...zLib/Transcriptomics/Digestion/OligoWithSetMods.cs	87.78%	20 Missing and 7 partials ⚠️
mzLib/UsefulProteomicsDatabases/ProteinDbWriter.cs	88.23%	9 Missing and 3 partials ⚠️
mzLib/Transcriptomics/NucleicAcid.cs	92.08%	6 Missing and 5 partials ⚠️
mzLib/MzLibUtil/PositionFrequencyAnalysis.cs	89.79%	7 Missing and 3 partials ⚠️
...ProteomicsDatabases/Transcriptomics/RnaDbLoader.cs	93.71%	3 Missing and 7 partials ⚠️
mzLib/UsefulProteomicsDatabases/ProteinXmlEntry.cs	79.54%	6 Missing and 3 partials ⚠️
mzLib/MzLibUtil/ClassExtensions.cs	79.48%	6 Missing and 2 partials ⚠️
mzLib/Transcriptomics/ClassExtensions.cs	93.84%	1 Missing and 3 partials ⚠️
...zLib/Transcriptomics/Digestion/NucleolyticOligo.cs	96.39%	1 Missing and 3 partials ⚠️
mzLib/FlashLFQ/Peptide.cs	57.14%	2 Missing and 1 partial ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #797      +/-   ##
==========================================
+ Coverage   75.52%   76.51%   +0.99%     
==========================================
  Files         202      212      +10     
  Lines       30945    32091    +1146     
  Branches     3129     3304     +175     
==========================================
+ Hits        23371    24556    +1185     
+ Misses       7040     6969      -71     
- Partials      534      566      +32

Files with missing lines	Coverage Δ
mzLib/Chemistry/ClassExtensions.cs	`100.00% <100.00%> (ø)`
mzLib/FlashLFQ/FlashLFQResults.cs	`92.02% <100.00%> (+0.17%)`	⬆️
mzLib/FlashLFQ/FlashLfqEngine.cs	`87.62% <100.00%> (+0.01%)`	⬆️
mzLib/MzLibUtil/MzLibException.cs	`100.00% <100.00%> (ø)`
.../Fragmentation/Oligo/DissociationTypeCollection.cs	`100.00% <100.00%> (+100.00%)`	⬆️
...ragmentation/Oligo/TerminusSpecificProductTypes.cs	`100.00% <100.00%> (ø)`
mzLib/Omics/IBioPolymerWithSetMods.cs	`95.23% <ø> (ø)`
mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs	`97.05% <100.00%> (-0.29%)`	⬇️
...ib/Transcriptomics/Digestion/RnaDigestionParams.cs	`100.00% <100.00%> (ø)`
mzLib/Transcriptomics/Digestion/Rnase.cs	`100.00% <100.00%> (ø)`
... and 15 more

... and 4 files with indirect coverage changes

…imports of TestPsmFromTsv. Added modInfo test for FlashLFQResults.

* Added in base classes * Implemented all tests * Made initial tests pass * Removed unnecessary namespaces * Expanded test coverage * Responded to Alex Comments * Add RNA support: loading, parsing, and decoy generation Introduced support for handling RNA data within the UsefulProteomicsDatabases project. Key changes include: - Added `Transcriptomics\TestData` folder to `Test.csproj`. - Changed access modifiers in `ProteinDbLoader.cs` to internal. - Added `using` directives for `Transcriptomics` in `ProteinXmlEntry.cs`. - Introduced methods `ParseRnaEndElement` and `ParseRnaEntryEndElement` in `ProteinXmlEntry.cs`. - Modified `ParseAnnotatedMods` to check for RNA modifications. - Added project reference to `Transcriptomics.csproj` in `UsefulProteomicsDatabases.csproj`. - Added `ClassExtensions.cs` with `CreateNew` method for nucleic acids. - Added `RnaDbLoader.cs` for RNA database loading. - Added `RnaDecoyGenerator.cs` for generating decoy RNA sequences. * Add new properties and caching to oligo digestion Updated `using` directives in `TestDigestion.cs` and `OligoWithSetMods.cs` to include necessary namespaces. Added assertions in `TestDigestion.cs` for `SequenceWithChemicalFormulas` and `FullSequenceWithMassShift`. Changed `namespace` in `OligoWithSetMods.cs` to `Transcriptomics.Digestion`. Implemented and cached `SequenceWithChemicalFormulas` property in `OligoWithSetMods.cs`. * Add RNA sequence and database handling and related test cases - Added new files `ModomicsUnmodifiedTrimmed.fasta` and `ModomicsUnmodifiedTrimmed.fasta.gz` to `Test.csproj` with `CopyToOutputDirectory` set to `PreserveNewest`. - Removed the `Transcriptomics\TestData` folder from `Test.csproj`. - Introduced `Transcribe` method in `ClassExtensions.cs` for DNA to RNA transcription. - Added summary comment to `NucleolyticOligo` class in `NucleolyticOligo.cs`. - Added `ApplyRegex` method in `FastaHeaderFieldRegex.cs`. - Introduced `ProteinDbWriter` class in `ProteinDbWriter.cs` for writing protein and nucleic acid databases. - Modified `GetModsForThisProtein` to `GetModsForThisBioPolymer` in `ProteinDbWriter.cs`. - Added `RnaDbLoader` class in `RnaDbLoader.cs` for RNA FASTA header detection and sequence loading. - Updated user dictionary in `mzLib.sln.DotSettings` with new terms. - Added test cases in `TestDbLoader.cs` for RNA database loading and header detection. - Introduced `TestDecoyGeneration` class in `TestDecoyGenerator.cs` for RNA decoy generation tests. - Added RNA sequence file `ModomicsUnmodifiedTrimmed.fasta` and its compressed version. * Refactor and enhance RNA and oligo handling in tests - Added `using` directives for `Transcriptomics.Digestion` and `UsefulProteomicsDatabases.Transcriptomics` in `TestDecoyGenerator.cs`. - Introduced `TestCreateNew` in `TestDecoyGenerator.cs` to verify RNA and oligo creation. - Added `using` directive for `MzLibUtil` in `TestDigestion.cs`. - Added a test in `TestDigestion.cs` for exception handling with invalid sequences. - Added `using` directives for `Omics` and related namespaces in `TestFragmentation.cs`. - Modified `TestFragmentation_Modified` in `TestFragmentation.cs` to use `OligoWithSetMods` directly and added assertions. - Updated `ClassExtensions.cs` to allow setting `isDecoy` in new `RNA` objects. - Refactored `OligoWithSetMods.cs` to return a dictionary from `GetModsAfterDeserialization`. - Updated `OligoWithSetMods.cs` to initialize `_allModsOneIsNterminus` using the returned dictionary. * Broke out TerminusSpecificProductTypes class and removed unnecessary namespaces * Update ProteinXmlEntry.cs * Added gene name to RNA constructore * Added gene name to RNA constructore * Refactor and enhance exception handling and tests Refactored constructors, improved exception handling, and added comprehensive tests across multiple files. Key changes include: - `MzLibException.cs`: Updated constructor to include `innerException`. - `TestDecoyGenerator.cs`: Added assertions for `CreateNew` method. - `TestDigestion.cs`: Added assertions and new test for RNA digestion exception. - Refactored modification lists and added various tests for modifications. - `TestNucleicAcid.cs`: Refactored methods, adjusted precision, and updated terminus assignments. - `NucleolyticOligo.cs`: Changed parameter types, updated comments, and improved variable names. - `OligoWithSetMods.cs`: Enhanced exception messages and updated modification location checks. - `NucleicAcid.cs`: Added `using` directive, changed exception type, and refactored methods. - `mzLib.sln.DotSettings`: Updated user dictionary entries. * Add test data files and methods for RNA sequence handling Added new test data files (`20mer1.fasta`, `20mer1.fasta.gz`, `20mer1.xml`, `20mer1.xml.gz`) to the `Transcriptomics\TestData` directory in the `Test.csproj` file, ensuring they are copied to the output directory. Introduced `TestDbReadingDifferentExtensions` in `TestDbLoader.cs` to verify RNA database reading from various formats. Added `TestDigestionMaxIsoforms` in `TestDigestion.cs` to test RNA sequence digestion with max isoforms. Updated `WriteNucleicAcidXmlDatabase` in `ProteinDbWriter.cs` with remarks for future implementation. Added a TODO in `RnaDecoyGenerator.cs` regarding palindromic sequences' impact on fragment ions. Included new RNA sequence data in test files for validation. * Added test coverage to the localize method within BioPolymerWithSetMods --------- Co-authored-by: Nic Bollis <[email protected]>

… dictionaries using some data objects instead for code readability. Updated all of the previous tests (MzLibUtil and FlashLFQ) to accomodate for the refactoring.

…ve amino acid positions depending on the length for the modification string and its index. Current approach fixes that.

…LibUtil method for calculating a generalized occupancy. The flashLFQ caluculation will call that and use intensity values for quantification.

…in MzLibUtil.PositionFrequencyAnalysis. ParseModifications and RemoveSpecialCharacters methods from Omics were moved to MzLibUtil. FlashLFQResults now implements a CalculatePTMOccupancy method that populates its ModInfo property. FlashLFQEngine calls the FlashLFQResults Method after the peptide and protein quantification. Still need to finish testing the FlashLFQResults and FlashLFQEngine outputs.

…arseModificatons in the Omics folder to be consistent with previous testing.

…ementations of the occupancy code due to issues with the PercolatorStyleIds(issue: peptide object did not have a ase sequence) and MatchBetweenRuns(issue: peptide marked for quantification not stored with an Peptide object) tests. Noticed some of the averaging tests were failing (issue: cleanup problem to to new directory names in TestOutputToCustomDirectoryAndNameMzML()), so I patched that, too.

…imports of TestPsmFromTsv. Added modInfo test for FlashLFQResults.

… dictionaries using some data objects instead for code readability. Updated all of the previous tests (MzLibUtil and FlashLFQ) to accomodate for the refactoring.

… ptm_stoich

trishorts · 2024-11-01T15:10:58Z

mzLib/MassSpectrometry/Enums/DissociationType.cs

@@ -109,6 +109,11 @@ public enum DissociationType
        /// </summary>
        LowCID,

+        /// <summary>


if this is not needed in the PR, let's put it in another and see that such a mechanism has all the fragmentation type associate with it.

trishorts · 2024-11-01T15:17:43Z

mzLib/MzLibUtil/ClassExtensions.cs

+        /// <param name="IncludeNTerminus"> If true, the index of modifications at the N-terminus will be 0 (zero-based indexing). Otherwise, it is the index of the first amino acid (one-based indexing).</param>
+        /// <param name="IncludeCTerminus"> If true, the index of modifications at the C-terminus will be one more than the index of the last amino acid. Otherwise, it is the index of the last amino acid.</param>
+        /// <returns> Dictionary with the key being the amino acid position of the mod and the value being the string representing the mod</returns>
+        public static Dictionary<int, List<string>> ParseModifications(this string fullSeq, bool IncludeNTerminus=false, bool IncludeCTerminus=false)


for one full sequence, there can be only 1 mod at each position, so there is no need for the value to be a list.
fullSeq => fullSequence
IncludeNTerminus => modOnNTerminus
=>modOnCTerminus

trishorts · 2024-11-01T15:19:15Z

mzLib/MzLibUtil/ClassExtensions.cs

+
+            // If there is a missed cleavage, then there will be a label on K and a Label on X modification.
+            // It'll be like [label]|[label] which complicates the positional stuff a little bit. Therefore, 
+            // RemoveSpecialCharacters will remove the "|", to ease things later on. 


I think we need to handle ambiguity outside of this method. There is no way to ensure that the base sequence in an ambiguous pair is the same, which will complicate matters. for example
PEPT[phospo]IDE | PET[phosph]IDE

trishorts · 2024-11-01T15:24:38Z

mzLib/MzLibUtil/ClassExtensions.cs

+        /// <returns></returns>
+        public static void RemoveSpecialCharacters(ref string fullSeq, string replacement = @"", string specialCharacter = @"\|")
+        {
+            // next regex is used in the event that multiple modifications are on a missed cleavage Lysine (K)


mods with transition metals will be problematic. suggest:
sequence = sequence.Replace("[I]", "");
sequence = sequence.Replace("[II]", "");
sequence = sequence.Replace("[III]", "");

trishorts · 2024-11-01T15:28:31Z

mzLib/MzLibUtil/ClassExtensions.cs

+        {
+            // use a regex to get all modifications
+            string pattern = @"\[(.+?)\]";
+            Regex regex = new(pattern);


we need to make sure that this method never thinks that
[hydroxylation]EPT[phospho] is accidentaly identified as a mod for P[hydroxylation]EPT[phospho]IDE
I'm not sure that ]EPT[ won't be ignored by your regex

trishorts · 2024-11-01T15:36:09Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+namespace MzLibUtil
+{
+    // Should this have all of the parent data (i.e. protein group, protein, peptide, peptide position)? Unnecessary for now, but probably useful later.
+    public class UtilModification


UtilModification => LocalizedModificationFromTsv
modName => IdWithMotif
position =>PeptidePositionZeroIsNterminus

trishorts · 2024-11-01T15:36:33Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+    {
+        public string FullSequence { get; set; }
+        public string BaseSequence { get; set; }
+        public UtilProtein ParentProtein { get; set; }


maybe this should be ProteinGroup?

trishorts · 2024-11-01T15:38:04Z

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

+        }
+    }
+
+    public class UtilProtein


flashlfq proteingroup

trishorts · 2024-11-01T15:42:13Z

it's possible that "Identification" should be the currency of this realm as it is what is passed into flashlfq by MM and it is what is generated by FlashLFQ when it is run alone on any acceptable input.

pcruzparri added 2 commits August 27, 2024 13:30

Bug fix. Previous ParseModifications implementation could give negati…

8bf52b1

…ve amino acid positions depending on the length for the modification string and its index. Current approach fixes that.

Saving draft implementation of a site-occupancy calculation.

dcede87

pcruzparri requested review from trishorts, Alexander-Sol and nbollis August 27, 2024 19:30

Alexander-Sol reviewed Aug 28, 2024

View reviewed changes

nbollis reviewed Sep 4, 2024

View reviewed changes

pcruzparri added 6 commits September 12, 2024 14:42

Merge branch 'master' of https://github.com/smith-chem-wisc/mzLib int…

0e59e82

…o ptm_stoich Updating with remote

Saving some initial progress on the occupancy calculation. Started Mx…

41ef6f4

…LibUtil method for calculating a generalized occupancy. The flashLFQ caluculation will call that and use intensity values for quantification.

temp

f06af28

Merge branch 'master' of https://github.com/smith-chem-wisc/mzLib int…

2ebe188

…o ptm_stoich

Removed the sandbox test Peter and changed the default arguments of P…

8d8658d

…arseModificatons in the Omics folder to be consistent with previous testing.

pcruzparri and others added 13 commits October 14, 2024 10:27

Fixed flipped logic in FlashLFQ/Peptide.GetTotalIntensity(). Cleaned …

74ed705

…imports of TestPsmFromTsv. Added modInfo test for FlashLFQResults.

Refactored the PositionFrequencyAnalysis code to eliminate the nested…

68165b0

… dictionaries using some data objects instead for code readability. Updated all of the previous tests (MzLibUtil and FlashLFQ) to accomodate for the refactoring.

Bug fix. Previous ParseModifications implementation could give negati…

58e6346

…ve amino acid positions depending on the length for the modification string and its index. Current approach fixes that.

Saving draft implementation of a site-occupancy calculation.

f0d67d0

Saving some initial progress on the occupancy calculation. Started Mx…

7b04937

…LibUtil method for calculating a generalized occupancy. The flashLFQ caluculation will call that and use intensity values for quantification.

temp

d2c240e

Removed the sandbox test Peter and changed the default arguments of P…

ef3ec35

…arseModificatons in the Omics folder to be consistent with previous testing.

Fixed flipped logic in FlashLFQ/Peptide.GetTotalIntensity(). Cleaned …

f21d365

…imports of TestPsmFromTsv. Added modInfo test for FlashLFQResults.

Refactored the PositionFrequencyAnalysis code to eliminate the nested…

f6caa30

… dictionaries using some data objects instead for code readability. Updated all of the previous tests (MzLibUtil and FlashLFQ) to accomodate for the refactoring.

Merge branch 'ptm_stoich' of https://github.com/pcruzparri/mzLib into…

b146768

… ptm_stoich

trishorts reviewed Nov 1, 2024

View reviewed changes

mzLib/MzLibUtil/PositionFrequencyAnalysis.cs

}

}

public class UtilProtein

Copy link

Contributor

trishorts Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flashlfq proteingroup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTM Stoichiometry #797

PTM Stoichiometry #797

pcruzparri commented Aug 27, 2024 •

edited by Alexander-Sol

Loading

Alexander-Sol Aug 28, 2024

nbollis left a comment

pcruzparri commented Oct 11, 2024

codecov bot commented Oct 11, 2024 •

edited

Loading

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts Nov 1, 2024

trishorts commented Nov 1, 2024

PTM Stoichiometry #797

Are you sure you want to change the base?

PTM Stoichiometry #797

Conversation

pcruzparri commented Aug 27, 2024 • edited by Alexander-Sol Loading

Choose a reason for hiding this comment

nbollis left a comment

Choose a reason for hiding this comment

pcruzparri commented Oct 11, 2024

codecov bot commented Oct 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishorts commented Nov 1, 2024

pcruzparri commented Aug 27, 2024 •

edited by Alexander-Sol

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading