-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy path02-recount2_PLIER_exploration.Rmd
366 lines (306 loc) · 12.4 KB
/
02-recount2_PLIER_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: "recount2 PLIER exploratory analyses"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
Pathway Level Information ExtractoR (**PLIER**) ([Mao, et al. _bioRxiv._ 2017.](https://doi.org/10.1101/116061)) is a framework that explicitly aligns latent variables (LVs) with prior knowledge in the form of (often curated) gene sets.
Comparisons of PLIER to other methods (e.g., sparse PCA) and other evaluations can be found in the PLIER preprint.
We're going to explore the [recount2](https://jhubiostatistics.shinyapps.io/recount/)
dataset and the corresponding [PLIER model](https://doi.org/10.6084/m9.figshare.5716033.v4).
(See [greenelab/rheum-plier-data](https://github.com/greenelab/rheum-plier-data/tree/4be547553f24fecac9e2f5c2b469a17f9df253f0)
for the processing code.)
We're interested in coming up with ways to characterize PLIER models (and eventually compare them).
## Functions
```{r}
`%>%` <- dplyr::`%>%`
# custom functions
source(file.path("util", "plier_util.R"))
```
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "02")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "02")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Load data
```{r}
# PLIER model
plier.results <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
# data that was prepped for use with PLIER
recount.list <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_data_prep_PLIER.RDS"))
```
## U matrix
If the prior information coefficient matrix, _U_, has a low number of positive
entries for each LV, biological interpretation should be more straightforward.
This is one of the constraints in the PLIER model.
### All LVs
For each latent variable (i.e., not just those significantly associated with
prior information), how many of the pathways/genesets have a positive entry?
```{r}
num.lvs <- nrow(plier.results$B)
u.sparsity.all <- CalculateUSparsity(plier.results = plier.results,
significant.only = FALSE)
ggplot2::ggplot(as.data.frame(u.sparsity.all),
ggplot2::aes(x = u.sparsity.all)) +
ggplot2::geom_density(fill = "blue", alpha = 0.5) +
ggplot2::theme_bw() +
ggplot2::labs(x = "proportion of positive entries in U") +
ggplot2::ggtitle(paste("All LVs, n =", num.lvs))
```
```{r}
png.file <- file.path(plot.dir, "recount2_prop_pos_entries_U_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
```{r}
summary(u.sparsity.all)
```
### Significant pathways, only
What proportion of entries in the U matrix for each LV are significantly
associated with that LV?
```{r}
u.sparsity.sig <- CalculateUSparsity(plier.results,
significant.only = TRUE,
fdr.cutoff = 0.05)
ggplot2::ggplot(as.data.frame(u.sparsity.sig),
ggplot2::aes(x = u.sparsity.sig)) +
ggplot2::geom_density(fill = "blue", alpha = 0.5) +
ggplot2::theme_bw() +
ggplot2::labs(x = "proportion of positive entries in U") +
ggplot2::ggtitle("Significant pathways only")
```
```{r}
png.file <- file.path(plot.dir,
"recount2_prop_pos_entries_U_significant.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
```{r}
summary(u.sparsity.sig)
```
## Pathway coverage
We're interested in how the LVs output from PLIER are related to the genesets
input to PLIER.
```{r}
coverage.results <- GetPathwayCoverage(plier.results = plier.results)
```
**What proportion of the pathways input into PLIER are significantly associated
(FDR cutoff = 0.05) with LVs?**
```{r}
# Pathway coverage results
coverage.results$pathway
```
**What proportion of the PLIER LVs have a gene set associated with them?**
```{r}
# LVs
coverage.results$lv
```
## Reconstruction of gene expression data
### All LVs
We reconstruct gene expression data from the gene loadings and LVs.
```{r}
# reconstructed recount2 expression data from PLIER model
recount.recon <- GetReconstructedExprs(z.matrix = as.matrix(plier.results$Z),
b.matrix = as.matrix(plier.results$B))
# write reconstructed expression to results
recon.mat.file <- file.path(results.dir,
"recount2_recount2_model_recon_exprs.RDS")
saveRDS(recount.recon, file = recon.mat.file)
# input expression data from intermediate file
recount.input.exprs <- recount.list$rpkm.cm
```
#### Reconstruction error
```{r}
# calculate reconstruction error (per sample)
recon.error <- GetReconstructionMASE(true.mat = recount.input.exprs,
recon.mat = recount.recon)
# density plot
ggplot2::ggplot(as.data.frame(recon.error), ggplot2::aes(x = recon.error)) +
ggplot2::geom_density(fill = "blue", alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample MASE",
title = "Input vs. PLIER reconstructed recount2 data",
subtitle = paste("All LVs, n =", num.lvs)) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_recon_MASE_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
#### Spearman correlation (input, reconstructed)
Spearman correlation between input and reconstructed values was used as an
evaluation in [Cleary, et al.](https://doi.org/10.1016/j.cell.2017.10.023)
As noted in the `01-PLIER_util_proof-of-concept_notebook`:
> If correlation between the input and the reconstructed data is high, that
suggests that reconstruction is "successful."
Given the different constraints in PLIER, we would not expect to perfectly
(`rho = 1`) reconstruct the input data.
This particular evaluation will be _most useful_ when we look at applying a
trained PLIER model to a test dataset.
```{r}
# calculate correlation
recon.cor <- GetReconstructionCorrelation(true.mat = recount.input.exprs,
recon.mat = recount.recon)
# density plot
ggplot2::ggplot(as.data.frame(recon.cor), ggplot2::aes(x = recon.cor)) +
ggplot2::geom_density(fill = "blue", alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample Spearman Correlation",
title = "Input vs. PLIER reconstructed recount2 data",
subtitle = paste("All LVs, n =", num.lvs)) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_recon_spearman_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
#### Relationship between error and correlation
We expect that samples that are highly correlated pre- and post-PLIER should
have low MASE.
```{r}
ggplot2::ggplot(as.data.frame(cbind(recon.cor, recon.error)),
ggplot2::aes(x = recon.cor,
y = recon.error)) +
ggplot2::geom_point(alpha = 0.2) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Spearman Correlation",
y = "MASE",
title = paste("All LVs, n =", num.lvs))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_error_cor_scatter_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
### Pathway-associated LVs, only
Here, we'll filter the _Z_ and _B_ matrices to only include LVs that are
significantly associated with a pathway/gene set that was supplied during the
training of the PLIER model.
We'll use an FDR cutoff of 0.05 (as we did for `CalculateUSparsity` above).
```{r}
plier.summary <- plier.results$summary
sig.summary <- plier.summary %>%
dplyr::filter(FDR < 0.05)
sig.lvs <- unique(sig.summary$`LV index`)
```
```{r}
# drop columns (LVs) from Z that are not significantly associated with prior
# info
z.mat <- plier.results$Z
sig.z.mat <- z.mat[, as.integer(sig.lvs)]
# drop rows (LVs) from B that are not significantly associated with prior info
b.mat <- plier.results$B
sig.b.mat <- b.mat[as.integer(sig.lvs), ]
```
```{r}
# the reconstruction itself only with significant LVs
sig.recon <- GetReconstructedExprs(z.matrix = sig.z.mat,
b.matrix = sig.b.mat)
# write to results
sig.recon.mat.file <-
file.path(results.dir, "recount2_recount2_model_sig_lvs_recon_exprs.RDS")
saveRDS(sig.recon, file = sig.recon.mat.file)
```
#### Reconstruction error
```{r}
# calculate reconstruction error (per sample)
sig.recon.error <- GetReconstructionMASE(true.mat = recount.input.exprs,
recon.mat = sig.recon)
```
#### Spearman correlation (input, reconstructed)
```{r}
# calculate correlation
sig.recon.cor <- GetReconstructionCorrelation(true.mat = recount.input.exprs,
recon.mat = sig.recon)
```
#### Plotting
```{r}
# tidy format
recon.eval.df <-
rbind(cbind(colnames(recount.input.exprs), recon.error, recon.cor,
rep(paste("All, n =", num.lvs), length(recon.error))),
cbind(colnames(recount.input.exprs), sig.recon.error, sig.recon.cor,
rep(paste("Pathway-associated, n =", length(sig.lvs)),
length(sig.recon.error))))
colnames(recon.eval.df) <- c("Sample", "MASE",
"Spearman correlation",
"LVs used in reconstruction")
recon.eval.df <-
as.data.frame(recon.eval.df) %>%
dplyr::mutate(MASE = as.numeric(as.character(MASE)),
`Spearman correlation` =
as.numeric(as.character(`Spearman correlation`)))
recon.eval.file <- file.path(results.dir,
"recount2_recount2_model_recon_eval_df.tsv")
readr::write_tsv(recon.eval.df,
path = recon.eval.file )
```
**MASE plot**
```{r}
# density plot
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = MASE, group = `LVs used in reconstruction`,
fill = `LVs used in reconstruction`)) +
ggplot2::geom_density(alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "black")) +
ggplot2::labs(title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_recon_MASE.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
**Correlation plot**
```{r}
# density plot
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = `Spearman correlation`,
group = `LVs used in reconstruction`,
fill = `LVs used in reconstruction`)) +
ggplot2::geom_density(alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "black")) +
ggplot2::labs(title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_recon_spearman.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
```
**Scatterplot**
```{r}
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = `Spearman correlation`,
y = MASE,
color = `LVs used in reconstruction`,
group = `LVs used in reconstruction`)) +
ggplot2::geom_point(alpha = 0.1) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample Spearman Correlation",
y = "Sample MASE",
title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::scale_color_grey() +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
png.file <- file.path(plot.dir,
"recount2_recon_scatter.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 10, height = 5, units = "in")
```