Skip to content

Commit

Permalink
Merge pull request #7 from MRCToxBioinformatics/add_phosphoproteomics
Browse files Browse the repository at this point in the history
Add phosphoproteomics notebooks
  • Loading branch information
TomSmithCGAT authored Jul 2, 2024
2 parents ef7bf80 + b3e9820 commit f083e94
Show file tree
Hide file tree
Showing 10 changed files with 4,496 additions and 25 deletions.
328 changes: 328 additions & 0 deletions Markdowns/TMT_phospho.Rmd

Large diffs are not rendered by default.

1,000 changes: 1,000 additions & 0 deletions Markdowns/TMT_phospho.html

Large diffs are not rendered by default.

207 changes: 207 additions & 0 deletions Markdowns/TMT_phospho_stats.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: "Phosphoproteomics using Tandem Mass Tags"
subtitle: "Intersecting phosphosites and total peptides and statistical testing"
author: "Tom Smith"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: show
pdf_document: default
bibliography: bib.json
---

### Load dependencies

Load the required libraries.

```{r, message=FALSE, warning=FALSE, include=FALSE}
library(camprotR)
library(MSnbase)
library(ggplot2)
library(tidyr)
library(dplyr)
library(here)
library(limma)
library(uniprotREST)
```


### Preamble

The correct approach for statistical testing for changes in phosphorylation will depend on the details of the experimental design. Here, we have quantified phospho and unmodified peptides, so we can identify changes in phosphorylation which are not explained by changes in the unmodified protein, e.g changes in the proportion of the protein that is phosphorylated.

To acheive this, we will jointly model the phospho and unmodified peptides which overlap each phosphosite, using a linear model with an interaction term.

### Input data

We start by reading in the MSnSets created in the previous notebook [QC PSM-level quantification, filtering and summarisation to protein-level abundance](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho.html)

```{r}
total <- readRDS(here('results/total.rds'))
phospho_sites <- readRDS(here('results/phospho_sites.rds'))
```


Below, we identify the total peptides that overlap each phosphosite. Note that each phosphosite may intersect multiple total peptides (due to missed cleavages) and each total peptide may intersect multiple phosphosites (due to multiple phosphosites per peptide).
```{r}
phospho_f <- phospho_sites %>% fData() %>% tibble::rownames_to_column('ID')
total_f <- total %>% fData() %>% tibble::rownames_to_column('ID')
phospho_peptide_to_total_peptides <- vector('list', length=length(phospho_f$ID))
names(phospho_peptide_to_total_peptides) <- phospho_f$ID
n <- 0
for(phospho_id in phospho_f$ID){
row <- phospho_f %>% filter(ID==phospho_id)
n <- n + 1
ptm_positions <- as.numeric(strsplit(row$ptm_position, split='\\|')[[1]])
total_peptide_ids <- total_f %>%
filter(Master.Protein.Accessions %in% row$Master.Protein.Accessions,
peptide_start<=min(ptm_positions),
peptide_end>=max(ptm_positions)) %>%
pull(ID)
phospho_peptide_to_total_peptides[[row$ID]] <- total_peptide_ids
}
```

Now, how many overlapping features. Note that for the phospho and total to be considered intersecting, there must be at least one total peptide which overlaps the phosphosite. Out of the `r length(phospho_f$ID)` phosphosites, only `r sum(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0))` have at least one overlapping total peptide.
```{r}
print(length(phospho_f$ID))
print(table(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0)))
print(table(sapply(phospho_peptide_to_total_peptides, length)))
```
limma requires a single row per feature with a columns per sample. Here, the phospho and total are separate samples so we need to combine them into a single row.
```{r}
combined_exprs <- phospho_peptide_to_total_peptides %>% names() %>% lapply(function(x){
total_peptides <- phospho_peptide_to_total_peptides[[x]]
if(length(total_peptides)>0) {
exprs_values <- c(exprs(phospho_sites[x,]),
colSums(exprs(total[total_peptides,]), na.rm=TRUE))
exprs_values <- unname(exprs_values)
return(exprs_values)
} else { return(NULL) }
})
names(combined_exprs) <- names(phospho_peptide_to_total_peptides)
```

Create the completed matrix of phospho and total.
```{r}
combined_exprs <- combined_exprs[sapply(combined_exprs, function(x) !is.null(x))]
all_combined_exprs <- bind_rows(combined_exprs) %>% t() %>% as.matrix
rownames(all_combined_exprs) <- names(combined_exprs)
colnames(all_combined_exprs) <- c(paste0(colnames(phospho_sites), '_phospho'),
paste0(colnames(total), '_total'))
```

Inspect the combined quantification matrix.
```{r}
head(all_combined_exprs) #
```

```{r}
all_combined_fdata <- fData(phospho_sites[rownames(all_combined_exprs),])[
,c('Master.Protein.Accessions', 'ptm_position')]
all_combined_fdata$n_total <- sapply(rownames(all_combined_exprs),
function(x) length(phospho_peptide_to_total_peptides[[x]]))
all_combined_pdata <- rbind((pData(phospho_sites) %>% mutate(type='phospho')),
(pData(total) %>% mutate(type='total')))
rownames(all_combined_pdata) <- colnames(all_combined_exprs)
all_combined_res <- MSnSet(as.matrix.data.frame(all_combined_exprs),
all_combined_fdata,
all_combined_pdata)
head(fData(all_combined_res))
pData(all_combined_res)
```

Below, we subset to the first 2 'conditions', where condition 1 = tags 1-4 (x1 phospho; x1 total) and condition 2 = tags 5-7 (x6 phospho; x2 total).
The ground truth for the difference between condition 2 and condition 1 is therefore a 3-fold increase in phosphorylation for the yeast proteins and reduced phosphorylation for the human proteins. For the human proteins, given the amounts of spike in yeast and balancing human proteins to make up the labelled material, we expect a 30% drop in phosphorylation.
```{r}
pairwise_comparison_combined_res <- all_combined_res[,pData(all_combined_res)$condition %in% 1:2]
pData(pairwise_comparison_combined_res)
```

### Run limma


Below, we perform the limma analysis. First though, some clarification on the model we're using.

We have two experimental variables:
- type: phospho or total
- condition: 1x or 2x spike in

```{r}
type <- factor(pData(pairwise_comparison_combined_res)$type, level=c('total', 'phospho'))
condition <- factor(pData(pairwise_comparison_combined_res)$condition, levels=1:2)
print(paste(type, condition))
```

When we use an model with the terms type, condition and type:condition, the interaction term captures the difference between the observed data and the simple additive model when the abundance is dependent upon just the separate effect of the type and condition variables and there is no combinatorial effect.
```{r}
# model without interaction term
print(model.matrix(~type+condition))
# model with an interaction term
print(model.matrix(~type+condition+type:condition))
```

Below, we run the linear modeling. For more details about `limma`, see the
[Differential abundance testing for TMT proteomics]('~/git_repos/bioinf_training/Proteomics.data.analysis/Markdowns/Stats_diff_abundance_TMT.Rmd') notebook.

```{r}
dat <- exprs(pairwise_comparison_combined_res) %>% log(base=2)
study.design <- model.matrix(~type*condition)
fit <- lmFit(dat, study.design)
fit <- eBayes(fit, trend=TRUE)
limma.results <- topTable(fit, coef = colnames(fit$coefficients)[4], n = Inf, confint=TRUE)
limma.results$sigma <- fit$sigma
```


We would like to add information about the species to make sure the fold-changes are in the expected direction. We'll use the `uniprotREST` package to query
```{r}
limma.results$protein <- rownames(limma.results) %>%
strsplit(split='_') %>%
sapply('[[', 1)
species_res <- uniprot_map(
ids = unique(limma.results$protein),
from = "UniProtKB_AC-ID",
to = "UniProtKB",
fields = "organism_name") %>%
mutate(species=recode(Organism,
"Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)"="S.cerevisiae",
"Homo sapiens (Human)"="H.sapiens"))
```
Below, we plot the limma results using a volcano plot, with two panels, one for each species.

```{r}
limma.results %>%
merge(species_res, by.x='protein', by.y='From') %>%
ggplot(aes(logFC, -log10(P.Value), fill=adj.P.Val<0.01)) +
geom_point(pch=21, size=2, colour='grey70') +
facet_wrap(~species) +
theme_camprot(base_size=15, base_family='sans', border=FALSE) +
theme(strip.background=element_blank()) +
xlab('Log2 fold change') +
ylab('-log10(p-value)') +
scale_fill_manual(values=c('grey90', get_cat_palette(2)[2]), name='FDR < 1%')
```


The direction of change is as expected (reduced for human, increased for yeast), but only a subset of the fold-changes reach a 1% FDR threshold for significance. The fold-changes are appoximately what we would expect too: human phosphosites should be reduced by 30% = `r round(log2(0.7), 2)` on a log2 scale and the yeast phosphosites should be increased by 3-fold = `r round(log2(3), 2)` on a log2 scale.


812 changes: 812 additions & 0 deletions Markdowns/TMT_phospho_stats.html

Large diffs are not rendered by default.

2,110 changes: 2,110 additions & 0 deletions Markdowns/TMT_phospho_stats.nb.html

Large diffs are not rendered by default.

56 changes: 34 additions & 22 deletions Markdowns/bib.json
Original file line number Diff line number Diff line change
Expand Up @@ -734,44 +734,56 @@
}
},
{
"id": "http://zotero.org/users/5634351/items/YY5VWWY7",
"id": "http://zotero.org/users/5634351/items/TZ6QFFXY",
"type": "article-journal",
"abstract": "MOTIVATION: When running experiments that involve multiple high density oligonucleotide arrays, it is important to remove sources of variation between arrays of non-biological origin. Normalization is a process for reducing this variation. It is common to see non-linear relations between arrays and the standard normalization provided by Affymetrix does not perform well in these situations.\nRESULTS: We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods because they make use of data from all arrays in an experiment to form the normalizing relation. These algorithms are compared to two methods that make use of a baseline array: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure. Two publicly available datasets are used to carry out the comparisons. The simplest and quickest complete data method is found to perform favorably.\nAVAILABILITY: Software implementing all three of the complete data normalization methods is available as part of the R package Affy, which is a part of the Bioconductor project http://www.bioconductor.org.\nSUPPLEMENTARY INFORMATION: Additional figures may be found at http://www.stat.berkeley.edu/~bolstad/normalize/index.html",
"container-title": "Bioinformatics (Oxford, England)",
"DOI": "10.1093/bioinformatics/19.2.185",
"ISSN": "1367-4803",
"issue": "2",
"journalAbbreviation": "Bioinformatics",
"abstract": "An algorithm for the assignment of phosphorylation sites in peptides is described. The program uses tandem mass spectrometry data in conjunction with the respective peptide sequences to calculate site probabilities for all potential phosphorylation sites. Tandem mass spectra from synthetic phosphopeptides were used for optimization of the scoring parameters employing all commonly used fragmentation techniques. Calculation of probabilities was adapted to the different fragmentation methods and to the maximum mass deviation of the analysis. The software includes a novel approach to peak extraction, required for matching experimental data to the theoretical values of all isoforms, by defining individual peak depths for the different regions of the tandem mass spectrum. Mixtures of synthetic phosphopeptides were used to validate the program by calculation of its false localization rate versus site probability cutoff characteristic. Notably, the empirical obtained precision was higher than indicated by the applied probability cutoff. In addition, the performance of the algorithm was compared to existing approaches to site localization such as Ascore. In order to assess the practical applicability of the algorithm to large data sets, phosphopeptides from a biological sample were analyzed, localizing more than 3000 nonredundant phosphorylation sites. Finally, the results obtained for the different fragmentation methods and localization tools were compared and discussed.",
"container-title": "Journal of Proteome Research",
"DOI": "10.1021/pr200611n",
"ISSN": "1535-3907",
"issue": "12",
"journalAbbreviation": "J Proteome Res",
"language": "eng",
"note": "PMID: 12538238",
"page": "185-193",
"note": "PMID: 22073976",
"page": "5354-5362",
"source": "PubMed",
"title": "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias",
"volume": "19",
"title": "Universal and confident phosphorylation site localization using phosphoRS",
"volume": "10",
"author": [
{
"family": "Bolstad",
"given": "B. M."
"family": "Taus",
"given": "Thomas"
},
{
"family": "Köcher",
"given": "Thomas"
},
{
"family": "Irizarry",
"given": "R. A."
"family": "Pichler",
"given": "Peter"
},
{
"family": "Astrand",
"given": "M."
"family": "Paschke",
"given": "Carmen"
},
{
"family": "Speed",
"given": "T. P."
"family": "Schmidt",
"given": "Andreas"
},
{
"family": "Henrich",
"given": "Christoph"
},
{
"family": "Mechtler",
"given": "Karl"
}
],
"issued": {
"date-parts": [
[
"2003",
1,
22
"2011",
12,
2
]
]
}
Expand Down
Binary file not shown.
Binary file added Markdowns/results/phospho_sites.rds
Binary file not shown.
Binary file added Markdowns/results/total.rds
Binary file not shown.
8 changes: 5 additions & 3 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ obtain the quantification data

Additional subsections are included to cover further topics for each flavour.


In addition to the core part of the course, there are extended materials to cover:
- Phosphoproteomics using Tandem Mass Tags

### 1. Label-Free Quantification (LFQ)

Expand Down Expand Up @@ -105,10 +106,11 @@ Additional subsections are included to cover further topics for each flavour.
- [Incorporation rate testing](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/SILAC_incorporation.html)


## Extended materials

- [Phosphoproteomics using TMT](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho.html)



- [Phosphoproteomics statistical testing](https://mrctoxbioinformatics.github.io/Proteomics_data_analysis/Markdowns/TMT_phospho_stats.html)


## Additional resources

0 comments on commit f083e94

Please sign in to comment.