Merge pull request #7 from MRCToxBioinformatics/add_phosphoproteomics

Add phosphoproteomics notebooks
MRCToxBioinformatics · Jul 2, 2024 · f083e94 · f083e94
2 parents ef7bf80 + b3e9820
commit f083e94
Show file tree

Hide file tree

Showing 10 changed files with 4,496 additions and 25 deletions.
diff --git a/Markdowns/TMT_phospho.Rmd b/Markdowns/TMT_phospho.Rmd
diff --git a/Markdowns/TMT_phospho.html b/Markdowns/TMT_phospho.html
diff --git a/Markdowns/TMT_phospho_stats.Rmd b/Markdowns/TMT_phospho_stats.Rmd
@@ -0,0 +1,207 @@
+---
+title: "Phosphoproteomics using Tandem Mass Tags"
+subtitle: "Intersecting phosphosites and total peptides and statistical testing"
+author: "Tom Smith"
+date: "`r Sys.Date()`"
+output:
+  html_document:
+    code_folding: show
+  pdf_document: default
+bibliography: bib.json
+---
+
+### Load dependencies
+
+Load the required libraries.
+
+```{r, message=FALSE, warning=FALSE, include=FALSE}
+library(camprotR)
+library(MSnbase)
+library(ggplot2)
+library(tidyr)
+library(dplyr)
+library(here)
+library(limma)
+library(uniprotREST)
+```
+
+
+### Preamble
+
+The correct approach for statistical testing for changes in phosphorylation will depend on the details of the experimental design. Here, we have quantified phospho and unmodified peptides, so we can identify changes in phosphorylation which are not explained by changes in the unmodified protein, e.g changes in the proportion of the protein that is phosphorylated. 
+
+To acheive this, we will jointly model the phospho and unmodified peptides which overlap each phosphosite, using a linear model with an interaction term.
+
+### Input data
+
+We start by reading in the MSnSets created in the previous notebook [QC PSM-level quantification, filtering and summarisation to protein-level abundance](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho.html)
+
+```{r}
+total <- readRDS(here('results/total.rds'))
+phospho_sites <- readRDS(here('results/phospho_sites.rds'))
+```
+
+
+Below, we identify the total peptides that overlap each phosphosite. Note that each phosphosite may intersect multiple total peptides (due to missed cleavages) and each total peptide may intersect multiple phosphosites (due to multiple phosphosites per peptide).
+```{r}
+phospho_f <- phospho_sites %>% fData() %>% tibble::rownames_to_column('ID')
+total_f <- total %>% fData() %>% tibble::rownames_to_column('ID')
+
+phospho_peptide_to_total_peptides <- vector('list', length=length(phospho_f$ID))
+names(phospho_peptide_to_total_peptides) <- phospho_f$ID
+n <- 0
+for(phospho_id in phospho_f$ID){
+  row <- phospho_f %>% filter(ID==phospho_id)
+  
+  n <- n + 1
+  ptm_positions <- as.numeric(strsplit(row$ptm_position, split='\\|')[[1]])
+  
+  total_peptide_ids <- total_f %>%
+    filter(Master.Protein.Accessions %in% row$Master.Protein.Accessions,
+           peptide_start<=min(ptm_positions),
+           peptide_end>=max(ptm_positions)) %>%
+    pull(ID)  
+  
+  phospho_peptide_to_total_peptides[[row$ID]] <- total_peptide_ids
+}
+
+```
+
+Now, how many overlapping features. Note that for the phospho and total to be considered intersecting, there must be at least one total peptide which overlaps the phosphosite. Out of the `r length(phospho_f$ID)` phosphosites, only `r sum(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0))` have at least one overlapping total peptide.
+```{r}
+print(length(phospho_f$ID))
+print(table(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0)))
+print(table(sapply(phospho_peptide_to_total_peptides, length)))
+```
+limma requires a single row per feature with a columns per sample. Here, the phospho and total are separate samples so we need to combine them into a single row.
+```{r}
+combined_exprs <- phospho_peptide_to_total_peptides %>% names() %>% lapply(function(x){
+  total_peptides <- phospho_peptide_to_total_peptides[[x]]
+  
+  if(length(total_peptides)>0) {
+    exprs_values <- c(exprs(phospho_sites[x,]),
+                      colSums(exprs(total[total_peptides,]), na.rm=TRUE))
+    exprs_values <- unname(exprs_values)
+    return(exprs_values)
+  } else { return(NULL) }
+})
+names(combined_exprs) <- names(phospho_peptide_to_total_peptides)
+```
+
+Create the completed matrix of phospho and total.
+```{r}
+
+combined_exprs <- combined_exprs[sapply(combined_exprs, function(x) !is.null(x))]
+all_combined_exprs <- bind_rows(combined_exprs) %>% t() %>% as.matrix
+
+rownames(all_combined_exprs) <- names(combined_exprs)
+
+colnames(all_combined_exprs) <- c(paste0(colnames(phospho_sites), '_phospho'),
+                                  paste0(colnames(total), '_total'))
+
+```
+
+Inspect the combined quantification matrix. 
+```{r}
+head(all_combined_exprs) #
+```
+
+```{r}
+all_combined_fdata <- fData(phospho_sites[rownames(all_combined_exprs),])[
+  ,c('Master.Protein.Accessions', 'ptm_position')]
+all_combined_fdata$n_total <- sapply(rownames(all_combined_exprs),
+                                     function(x) length(phospho_peptide_to_total_peptides[[x]]))
+
+all_combined_pdata <- rbind((pData(phospho_sites) %>% mutate(type='phospho')),
+                            (pData(total) %>% mutate(type='total')))
+rownames(all_combined_pdata) <- colnames(all_combined_exprs)
+all_combined_res <- MSnSet(as.matrix.data.frame(all_combined_exprs),
+                           all_combined_fdata,
+                           all_combined_pdata)
+
+head(fData(all_combined_res))
+pData(all_combined_res)
+```
+
+Below, we subset to the first 2 'conditions', where condition 1 = tags 1-4 (x1 phospho; x1 total) and condition 2 = tags 5-7 (x6 phospho; x2 total). 
+The ground truth for the difference between condition 2 and condition 1 is therefore a 3-fold increase in phosphorylation for the yeast proteins and reduced phosphorylation for the human proteins. For the human proteins, given the amounts of spike in yeast and balancing human proteins to make up the labelled material, we expect a 30% drop in phosphorylation.
+```{r}
+pairwise_comparison_combined_res <- all_combined_res[,pData(all_combined_res)$condition %in% 1:2]
+pData(pairwise_comparison_combined_res)
+```
+
+### Run limma
+
+
+Below, we perform the limma analysis. First though, some clarification on the model we're using.
+
+We have two experimental variables:
+- type: phospho or total
+- condition: 1x or 2x spike in
+
+```{r}
+type <- factor(pData(pairwise_comparison_combined_res)$type, level=c('total', 'phospho'))
+condition <- factor(pData(pairwise_comparison_combined_res)$condition, levels=1:2)
+print(paste(type, condition))
+```
+
+When we use an model with the terms type, condition and type:condition, the interaction term captures the difference between the observed data and the simple additive model when the abundance is dependent upon just the separate effect of the type and condition variables and there is no combinatorial effect.
+```{r}
+# model without interaction term
+print(model.matrix(~type+condition))
+
+# model with an interaction term
+print(model.matrix(~type+condition+type:condition))
+
+```
+
+Below, we run the linear modeling. For more details about `limma`, see the
+[Differential abundance testing for TMT proteomics]('~/git_repos/bioinf_training/Proteomics.data.analysis/Markdowns/Stats_diff_abundance_TMT.Rmd') notebook. 
+
+```{r}
+dat <- exprs(pairwise_comparison_combined_res) %>% log(base=2)
+study.design <- model.matrix(~type*condition)
+
+fit <- lmFit(dat, study.design)
+fit <- eBayes(fit, trend=TRUE)
+
+limma.results <- topTable(fit, coef = colnames(fit$coefficients)[4], n = Inf, confint=TRUE)
+limma.results$sigma <- fit$sigma
+
+```
+
+
+We would like to add information about the species to make sure the fold-changes are in the expected direction. We'll use the `uniprotREST` package to query 
+```{r}
+limma.results$protein <- rownames(limma.results) %>%
+  strsplit(split='_') %>%
+  sapply('[[', 1) 
+
+species_res <-  uniprot_map(
+  ids = unique(limma.results$protein),
+  from = "UniProtKB_AC-ID",
+  to = "UniProtKB",
+  fields = "organism_name") %>%
+  mutate(species=recode(Organism,
+                        "Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)"="S.cerevisiae",
+                        "Homo sapiens (Human)"="H.sapiens"))
+```
+Below, we plot the limma results using a volcano plot, with two panels, one for each species.
+
+```{r}
+limma.results %>%
+  merge(species_res, by.x='protein', by.y='From') %>%
+  ggplot(aes(logFC, -log10(P.Value), fill=adj.P.Val<0.01)) +
+  geom_point(pch=21, size=2, colour='grey70') +
+  facet_wrap(~species) +
+  theme_camprot(base_size=15, base_family='sans', border=FALSE) +
+  theme(strip.background=element_blank()) +
+  xlab('Log2 fold change') +
+  ylab('-log10(p-value)') +
+  scale_fill_manual(values=c('grey90', get_cat_palette(2)[2]), name='FDR < 1%')
+```
+
+
+The direction of change is as expected (reduced for human, increased for yeast), but only a subset of the fold-changes reach a 1% FDR threshold for significance. The fold-changes are appoximately what we would expect too: human phosphosites should be reduced by 30% = `r round(log2(0.7), 2)` on a log2 scale and the yeast phosphosites should be increased by 3-fold = `r round(log2(3), 2)` on a log2 scale.
+
+
diff --git a/Markdowns/TMT_phospho_stats.html b/Markdowns/TMT_phospho_stats.html
diff --git a/Markdowns/TMT_phospho_stats.nb.html b/Markdowns/TMT_phospho_stats.nb.html
diff --git a/Markdowns/bib.json b/Markdowns/bib.json
@@ -734,44 +734,56 @@
 		}
 	},
 	{
-		"id": "http://zotero.org/users/5634351/items/YY5VWWY7",
+		"id": "http://zotero.org/users/5634351/items/TZ6QFFXY",
 		"type": "article-journal",
-		"abstract": "MOTIVATION: When running experiments that involve multiple high density oligonucleotide arrays, it is important to remove sources of variation between arrays of non-biological origin. Normalization is a process for reducing this variation. It is common to see non-linear relations between arrays and the standard normalization provided by Affymetrix does not perform well in these situations.\nRESULTS: We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods because they make use of data from all arrays in an experiment to form the normalizing relation. These algorithms are compared to two methods that make use of a baseline array: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure. Two publicly available datasets are used to carry out the comparisons. The simplest and quickest complete data method is found to perform favorably.\nAVAILABILITY: Software implementing all three of the complete data normalization methods is available as part of the R package Affy, which is a part of the Bioconductor project http://www.bioconductor.org.\nSUPPLEMENTARY INFORMATION: Additional figures may be found at http://www.stat.berkeley.edu/~bolstad/normalize/index.html",
-		"container-title": "Bioinformatics (Oxford, England)",
-		"DOI": "10.1093/bioinformatics/19.2.185",
-		"ISSN": "1367-4803",
-		"issue": "2",
-		"journalAbbreviation": "Bioinformatics",
+		"abstract": "An algorithm for the assignment of phosphorylation sites in peptides is described. The program uses tandem mass spectrometry data in conjunction with the respective peptide sequences to calculate site probabilities for all potential phosphorylation sites. Tandem mass spectra from synthetic phosphopeptides were used for optimization of the scoring parameters employing all commonly used fragmentation techniques. Calculation of probabilities was adapted to the different fragmentation methods and to the maximum mass deviation of the analysis. The software includes a novel approach to peak extraction, required for matching experimental data to the theoretical values of all isoforms, by defining individual peak depths for the different regions of the tandem mass spectrum. Mixtures of synthetic phosphopeptides were used to validate the program by calculation of its false localization rate versus site probability cutoff characteristic. Notably, the empirical obtained precision was higher than indicated by the applied probability cutoff. In addition, the performance of the algorithm was compared to existing approaches to site localization such as Ascore. In order to assess the practical applicability of the algorithm to large data sets, phosphopeptides from a biological sample were analyzed, localizing more than 3000 nonredundant phosphorylation sites. Finally, the results obtained for the different fragmentation methods and localization tools were compared and discussed.",
+		"container-title": "Journal of Proteome Research",
+		"DOI": "10.1021/pr200611n",
+		"ISSN": "1535-3907",
+		"issue": "12",
+		"journalAbbreviation": "J Proteome Res",
 		"language": "eng",
-		"note": "PMID: 12538238",
-		"page": "185-193",
+		"note": "PMID: 22073976",
+		"page": "5354-5362",
 		"source": "PubMed",
-		"title": "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias",
-		"volume": "19",
+		"title": "Universal and confident phosphorylation site localization using phosphoRS",
+		"volume": "10",
 		"author": [
 			{
-				"family": "Bolstad",
-				"given": "B. M."
+				"family": "Taus",
+				"given": "Thomas"
+			},
+			{
+				"family": "Köcher",
+				"given": "Thomas"
 			},
 			{
-				"family": "Irizarry",
-				"given": "R. A."
+				"family": "Pichler",
+				"given": "Peter"
 			},
 			{
-				"family": "Astrand",
-				"given": "M."
+				"family": "Paschke",
+				"given": "Carmen"
 			},
 			{
-				"family": "Speed",
-				"given": "T. P."
+				"family": "Schmidt",
+				"given": "Andreas"
+			},
+			{
+				"family": "Henrich",
+				"given": "Christoph"
+			},
+			{
+				"family": "Mechtler",
+				"given": "Karl"
 			}
 		],
 		"issued": {
 			"date-parts": [
 				[
-					"2003",
-					1,
-					22
+					"2011",
+					12,
+					2
 				]
 			]
 		}

diff --git a/Markdowns/results/benchmark_tmt_phospho_psm_filt.rds b/Markdowns/results/benchmark_tmt_phospho_psm_filt.rds
diff --git a/Markdowns/results/phospho_sites.rds b/Markdowns/results/phospho_sites.rds
diff --git a/Markdowns/results/total.rds b/Markdowns/results/total.rds
diff --git a/index.md b/index.md
@@ -75,7 +75,8 @@ obtain the quantification data
 
 Additional subsections are included to cover further topics for each flavour.
 
-
+In addition to the core part of the course, there are extended materials to cover:
+- Phosphoproteomics using Tandem Mass Tags
 
 ### 1. Label-Free Quantification (LFQ)
 
@@ -105,10 +106,11 @@ Additional subsections are included to cover further topics for each flavour.
 - [Incorporation rate testing](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/SILAC_incorporation.html)
 
 
+## Extended materials
 
+- [Phosphoproteomics using TMT](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho.html)
 
-
-
+- [Phosphoproteomics statistical testing](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho_stats.html)
 
 
 ## Additional resources