forked from broadinstitute/2020_scWorkshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
15-citeseq-lab.Rmd
305 lines (230 loc) · 15.1 KB
/
15-citeseq-lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
title: "15-citeseq-lab.Rmd"
author: "Orr Ashenberg"
date: "3/26/2020"
output: html_document
---
# CITE-Seq
In this lab, we will look at how single cell RNA-seq and single cell protein expression measurement datasets can be jointly analyzed, as part of a CITE-Seq experiment. To learn more about how the antibody barcode matrix is computationally generated from the sequencing data, please visit [CITE-seq-Count](https://hoohm.github.io/CITE-seq-Count/). To learn more about CITE-Seq and feature barcoding, please visit the [CITE-seq site](https://cite-seq.com/).
This lab closely follows the official vignette available at [Using Seurat with multi-modal data](https://satijalab.org/seurat/multimodal_vignette.html).
## Load settings and packages
```{r setup_cite}
knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(Matrix)
library(ggplot2)
library(patchwork)
library(dplyr)
library(plyr)
# Set folder location for saving output files. This is also the same location as input data.
mydir <- "data/citeseq/"
setwd("/home/rstudio/materials/")
# Objects to save.
Rda.quickload.path <- paste0(mydir, "citeseq_quickload.Rda") # datasets saved as sparse objects
Rda.RNA.path <- paste0(mydir, "citeseq_RNA.Rda") # cbmc clustered using RNA
Rda.multi.path <- paste0(mydir, "citesq_cbmc_multi.rda") # cbmc clustered and ADT added as an assay
Rda.protein.path <- paste0(mydir, "citeseq_protein.Rda") # cbmc clustered using protein
```
## Load in the data
Here, we analyze a dataset of 8,617 cord blood mononuclear cells (CBMCs), produced with CITE-seq, where we simultaneously measure the single cell transcriptomes alongside the expression of 11 surface proteins, whose levels are quantified with DNA-barcoded antibodies. First, we load in two count matrices : one for the RNA measurements, and one for the antibody-derived tags (ADT).
```{r load_umi_mat, eval = FALSE}
# Load in the RNA UMI matrix
# Note that this dataset also contains ~5% of mouse cells, which we can use
# as negative controls for the protein measurements. For this reason, the
# gene expression matrix has HUMAN_ or MOUSE_ appended to the beginning of
# each gene.
cbmc.rna <- as.sparse(read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1))
cbmc.rna[20400:20403,1:2]
# To make life a bit easier going forward, we're going to discard all but
# the top 100 most highly expressed mouse genes, and remove the 'HUMAN_'
# from the CITE-seq prefix
cbmc.rna <- CollapseSpeciesExpressionMatrix(cbmc.rna, prefix = "HUMAN_", controls = "MOUSE_")
# Load in the ADT UMI matrix
cbmc.adt <- as.sparse(read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1))
# When adding multimodal data to Seurat, it's okay to have duplicate feature names. Each set of
# modal data (eg. RNA, ADT, etc.) is stored in its own Assay object. One of these Assay objects
# is called the 'default assay', meaning it's used for all analyses and visualization. To pull
# data from an assay that isn't the default, you can specify a key that's linked to an assay for
# feature pulling. To see all keys for all objects, use the Key function.
# Lastly, we observed poor enrichments for CCR5, CCR7, and CD10 - and therefore
# remove them from the matrix (optional)
cbmc.adt <- cbmc.adt[setdiff(rownames(x = cbmc.adt), c("CCR5", "CCR7", "CD10")), ]
# Look at structure of ADT matrix.
cbmc.adt[1:10,1:3]
# What fraction of cells in the ADT and RNA matrix overlap?
length(intersect(colnames(cbmc.rna), colnames(cbmc.adt))) / length(union(colnames(cbmc.rna), colnames(cbmc.adt)))
# Save current progress.
save(cbmc.rna, cbmc.adt, file = Rda.quickload.path)
```
## Setup a Seurat object, and cluster cells based on RNA expression
The steps below represent a quick clustering of the PBMCs based on the scRNA-seq data. For more detail on individual steps or more advanced options, see our PBMC clustering guided tutorial here
```{r cbmc_obj}
load(Rda.quickload.path)
cbmc <- CreateSeuratObject(counts = cbmc.rna)
# This code sub-samples the data in order to speed up calculations and not use too much memory.
# Idents(cbmc) <- "orig.ident"
# cbmc <- subset(cbmc, downsample = 2000, seed = 1)
# cbmc.adt <- cbmc.adt[, colnames(cbmc)]
# standard log-normalization
cbmc <- NormalizeData(cbmc)
# choose ~1k variable features
cbmc <- FindVariableFeatures(cbmc)
# standard scaling (no regression)
cbmc <- ScaleData(cbmc)
# Run PCA, select PCs for tSNE visualization and graph-based clustering
cbmc <- RunPCA(cbmc, verbose = FALSE)
ElbowPlot(cbmc, ndims = 50)
# Cluster the cells using the first 25 principal components.
cbmc <- FindNeighbors(cbmc, dims = 1:25)
cbmc <- FindClusters(cbmc, resolution = 0.8)
cbmc <- RunTSNE(cbmc, dims = 1:25, method = "FIt-SNE")
# Find the markers that define each cluster, and use these to annotate the
# clusters, we use max.cells.per.ident to speed up the process
cbmc.rna.markers <- FindAllMarkers(cbmc, max.cells.per.ident = 100, logfc.threshold = log(2), only.pos = TRUE, min.diff.pct = 0.3)
# Which cluster consists of mouse cells?
cbmc.rna.markers %>% filter(cluster == 5)
cbmc.rna.markers %>% filter(cluster == 13)
# Note, for simplicity we are merging two CD14+ Monocyte clusters (that differ in expression of
# HLA-DR genes) and NK clusters (that differ in cell cycle stage)
new.cluster.ids <- c("Memory CD4 T", "CD14+ Mono", "Naive CD4 T", "NK", "CD14+ Mono", "Mouse", "B", "CD8 T", "CD16+ Mono", "T/Mono doublets", "NK", "CD34+", "Multiplets", "Mouse", "Eryth", "Mk", "Mouse", "DC", "pDCs")
names(new.cluster.ids) <- levels(cbmc)
cbmc <- RenameIdents(cbmc, new.cluster.ids)
# Visualize clustering based on RNA.
DimPlot(cbmc, label = TRUE, reduction = "tsne") + NoLegend()
# Save current progress.
save(cbmc, cbmc.rna.markers, cbmc.adt, file = Rda.RNA.path)
# To load the data, run the following command.
# load(Rda.RNA.path)
```
## Add the protein expression levels to the Seurat object
Seurat v3.0 allows you to store information from multiple assays in the same object, as long as the data is multi-modal (collected on the same set of cells). You can use the `SetAssayData` and `GetAssayData` accessor functions to add and fetch data from additional assays.
```{r add_protein}
# We will define an ADT assay, and store raw counts for it
# If you are interested in how these data are internally stored, you can check out the Assay
# class, which is defined in objects.R; note that all single-cell expression data, including RNA
# data, are still stored in Assay objects, and can also be accessed using GetAssayData
cbmc[["ADT"]] <- CreateAssayObject(counts = cbmc.adt)
GetAssayData(cbmc, slot = "counts", assay = "ADT")[1:3,1:3]
cbmc[["ADT"]]@counts[1:3, 1:3]
# Now we can repeat the preprocessing (normalization and scaling) steps that we typically run
# with RNA, but modifying the 'assay' argument. For CITE-seq data, we do not recommend typical
# LogNormalization. Instead, we use a centered log-ratio (CLR) normalization, computed
# independently for each feature. This is a slightly improved procedure from the original
# publication, and we will release more advanced versions of CITE-seq normalizations soon.
cbmc <- NormalizeData(cbmc, assay = "ADT", normalization.method = "CLR")
cbmc <- ScaleData(cbmc, assay = "ADT")
```
## Visualize protein levels on RNA clusters
You can use the names of any ADT markers, (i.e. “adt_CD4”), in FetchData, FeaturePlot, RidgePlot, FeatureScatter, DoHeatmap, or any other visualization features
```{r viz_protein}
DefaultAssay(cbmc) <- "RNA"
# In this plot, protein (ADT) levels are on top, and RNA levels are on the bottom
FeaturePlot(cbmc, features = c("adt_CD3", "adt_CD11c", "adt_CD8", "adt_CD16", "CD3E", "ITGAX", "CD8A", "FCGR3A"), min.cutoff = "q05", max.cutoff = "q95", ncol = 4)
# How do the gene and protein expression levels compare to one another?
# Compare gene and protein expression levels for the other 6 antibodies.
FeaturePlot(cbmc, features = c("adt_CD4", "adt_CD45RA", "adt_CD56", "adt_CD14", "adt_CD19", "adt_CD34", "CD4", "PTPRC", "NCAM1", "CD14", "CD19", "CD34"), min.cutoff = "q05", max.cutoff = "q95", ncol = 6)
# Ridge plots are another useful visualization.
RidgePlot(cbmc, features = c("adt_CD3", "adt_CD11c", "adt_CD8", "adt_CD16"), ncol = 2)
# Draw ADT scatter plots (like biaxial plots for FACS). Note that you can even 'gate' cells if
# desired by using HoverLocator and CellSelector
FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3")
# HoverLocator(FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3"))
# CellSelector(FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3"))
# View relationship between protein and RNA
FeatureScatter(cbmc, feature1 = "adt_CD3", feature2 = "CD3E")
# Let's plot CD4 vs CD8 levels in T cells
tcells <- subset(cbmc, idents = c("Naive CD4 T", "Memory CD4 T", "CD8 T"))
FeatureScatter(tcells, feature1 = "adt_CD4", feature2 = "adt_CD8")
# Let's look at the raw (non-normalized) ADT counts. You can see the values are quite high,
# particularly in comparison to RNA values. This is due to the significantly higher protein copy
# number in cells, which significantly reduces 'drop-out' in ADT data
FeatureScatter(tcells, feature1 = "adt_CD4", feature2 = "adt_CD8", slot = "counts")
# If you look a bit more closely, you'll see that our CD8 T cell cluster is
# enriched for CD8 T cells, but still contains many CD4+ CD8- T cells. This
# is because Naive CD4 and CD8 T cells are quite similar transcriptomically,
# and the RNA dropout levels for CD4 and CD8 are quite high. This
# demonstrates the challenge of defining subtle immune cell differences from
# scRNA-seq data alone.
# What fraction of T cells are double negative in gene expression? (CD4- and CD8-)
# You can use an interactive plot to gate on the cells (do.identify = T) or use
# Boolean conditions on CD4 and CD8A expression to find double negative cells.
FeatureScatter(tcells, feature1 = "CD4", feature2 = "CD8A")
ncol(subset(tcells, subset = CD4 == 0 & CD8A == 0)) / ncol(tcells)
# What fraction of T cells are double negative in protein expression? (CD4- and CD8-)
# length(cells) / length([email protected])
DefaultAssay(tcells) <- "ADT" # work with ADT count matrix
FeatureScatter(tcells, feature1 = "adt_CD4", feature2 = "adt_CD8")
ncol(subset(tcells, subset = adt_CD4 < 1 & adt_CD8 < 1)) / ncol(tcells)
# Save current progress.
save(cbmc, file = Rda.multi.path)
# To load the data, run the following command.
# load(Rda.multi.path)
```
## Identify differentially expressed proteins between clusters
```{r ident_diff}
# Downsample the clusters to a maximum of 300 cells each (makes the heatmap easier to see for
# small clusters)
cbmc.small <- subset(cbmc, downsample = 300)
# Find protein markers for all clusters, and draw a heatmap
adt.markers <- FindAllMarkers(cbmc.small, assay = "ADT", only.pos = TRUE)
DoHeatmap(cbmc.small, features = unique(adt.markers$gene), assay = "ADT", angle = 90) + NoLegend()
# You can see that our unknown cells co-express both myeloid and lymphoid markers (true at the
# RNA level as well). They are likely cell clumps (multiplets) that should be discarded. We'll
# remove the mouse cells now as well
cbmc <- subset(cbmc, idents = c("Multiplets", "Mouse"), invert = TRUE)
```
## Cluster directly on protein levels
You can also run dimensional reduction and graph-based clustering directly on CITE-seq data
```{r clust_proteins}
# Because we're going to be working with the ADT data extensively, we're going to switch the
# default assay to the 'CITE' assay. This will cause all functions to use ADT data by default,
# rather than requiring us to specify it each time
DefaultAssay(cbmc) <- "ADT"
cbmc <- RunPCA(cbmc, features = rownames(cbmc), reduction.name = "pca_adt", reduction.key = "pca_adt_", verbose = FALSE)
DimPlot(cbmc, reduction = "pca_adt")
# Why do we not use PCA to do dimensionality reduction here?
# Is Euclidean distance a good distance metric in this case?
ElbowPlot(cbmc)
# Since we only have 10 markers, instead of doing PCA, we'll just use a standard euclidean
# distance matrix here. Also, this provides a good opportunity to demonstrate how to do
# visualization and clustering using a custom distance matrix in Seurat
adt.data <- GetAssayData(cbmc, slot = "data")
adt.dist <- dist(t(adt.data))
# Before we recluster the data on ADT levels, we'll stash the RNA cluster IDs for later
cbmc[["rnaClusterID"]] <- Idents(cbmc)
# Now, we rerun tSNE using our distance matrix defined only on ADT (protein) levels.
cbmc[["tsne_adt"]] <- RunTSNE(adt.dist, assay = "ADT", reduction.key = "adtTSNE_")
cbmc[["adt_snn"]] <- FindNeighbors(adt.dist)$snn
cbmc <- FindClusters(cbmc, resolution = 0.2, graph.name = "adt_snn")
# We can compare the RNA and protein clustering, and use this to annotate the protein clustering
# (we could also of course use FindMarkers)
clustering.table <- table(Idents(cbmc), cbmc$rnaClusterID)
clustering.table
new.cluster.ids <- c("CD4 T", "CD14+ Mono", "NK", "B", "CD8 T", "NK", "CD34+", "T/Mono doublets", "CD16+ Mono", "pDCs", "B")
names(new.cluster.ids) <- levels(cbmc)
cbmc <- RenameIdents(cbmc, new.cluster.ids)
tsne_rnaClusters <- DimPlot(cbmc, reduction = "tsne_adt", group.by = "rnaClusterID", pt.size = 0.5) + NoLegend()
tsne_rnaClusters <- tsne_rnaClusters + ggtitle("Clustering based on scRNA-seq") + theme(plot.title = element_text(hjust = 0.5))
tsne_rnaClusters <- LabelClusters(plot = tsne_rnaClusters, id = "rnaClusterID", size = 4)
tsne_adtClusters <- DimPlot(cbmc, reduction = "tsne_adt", pt.size = 0.5) + NoLegend()
tsne_adtClusters <- tsne_adtClusters + ggtitle("Clustering based on ADT signal") + theme(plot.title = element_text(hjust = 0.5))
tsne_adtClusters <- LabelClusters(plot = tsne_adtClusters, id = "ident", size = 4)
# Note: for this comparison, both the RNA and protein clustering are visualized on a tSNE
# generated using the ADT distance matrix.
wrap_plots(list(tsne_rnaClusters, tsne_adtClusters), ncol = 2)
# What differences if any do you see between the clustering based on scRNA-seq
# and the clustering based on ADT signal?
# How could we combine these datasets in a joint, integrative analysis?
# Save current progress.
save(cbmc, file = Rda.protein.path)
# To load the data, run the following command.
# load(Rda.protein.path)
```
The ADT-based clustering yields similar results, but with a few differences
- Clustering is improved for CD4/CD8 T cell populations, based on the robust ADT data for CD4, CD8, CD14, and CD45RA
- However, some clusters for which the ADT data does not contain good distinguishing protein markers (i.e. Mk/Ery/DC) lose separation
- You can verify this using FindMarkers at the RNA level, as well
## Additional exploration: another example of multi-modal analysis
For another nice example of multi-modal analysis, please explore this [single cell ATAC-Seq vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/cicero/inst/doc/website.html) and this [scATAC-seq data integration](https://satijalab.org/signac/articles/integration.html).
## Acknowledgements
This document is largely a tutorial from Seurat website, with some small modifications. The official vignette is available at [Using Seurat with multi-modal data](https://satijalab.org/seurat/multimodal_vignette.html).