-
Notifications
You must be signed in to change notification settings - Fork 10
/
README.Rmd
127 lines (98 loc) · 5.63 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# CDSeq
<!-- badges: start -->
[![Travis build status](https://travis-ci.com/kkang7/CDSeq_R_Package.svg?branch=master)](https://travis-ci.com/kkang7/CDSeq_R_Package)
<!-- badges: end -->
CDSeq is a complete deconvolution method for dissecting bulk RNA-Seq data. The input of CDSeq is, ideally, bulk RNA-Seq read counts (similar to the input format required by DESeq2), and CDSeq will estimate, simultaneously, the cell-type-specific gene expression profiles and the sample-specific cell-type proportions, no reference of pure cell line GEPs or scRNAseq reference is needed for running CDSeq.
For example, if you have a bulk RNA-Seq data, a G by M matrix **A**, which is a G by M matrix. G denotes the number of genes and M is the sample size, then CDSeq will output **B** (a G by T matrix) and **C** (a T by M matrix), where T is the number of cell types, **B** is the estimate of cell-type-specific GEPs and **C** is the estimate of sample-specific cell-type proportions.
Importantly, you can ask CDSeq to estimate the number of cell types, i.e. T, by providing a vector of possible integer values for T. For example, if the user input for T is a vector, i.e. $T=\{2,3,4,5,6\}$, then CDSeq will estimate the most likely number for T.
## Installation
You can install the released version of CDSeq from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("CDSeq")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package")
```
build the vignette with
``` r
# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package", build_vignettes = TRUE)
```
## Known issue about MacOS installation
It is possible for Mac users to run into some errors when install from source due to problems of Rcpp compiler tools. Follow the instruction here may help: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/
## Example
Load package
```{r, echo=TRUE, eval=TRUE}
library(CDSeq)
```
### When the number of cell types is a scalar
```{r example 1, echo=TRUE ,eval=FALSE, cache=FALSE, results='hide'}
## basic example code
result1<-CDSeq(bulk_data = mixtureGEP,
cell_type_number = 6,
mcmc_iterations = 5, # increase the mcmc_iterations to 700 or above
gene_length = as.vector(gene_length),
reference_gep = refGEP, # gene expression profile of pure cell lines
cpu_number = 1)
```
### When the number of cell types is a vector
The *cell_type_number* can also be a vector which contains different integer values. CDSeq will perform estimation for each integer in the vector and estimate the number of cell types in the mixtures. For example, one can set *cell_type_number = 2:10* as follows, and CDSeq will estimate the most likely number of cell types from 2 to 10.
```{r example 2, echo=TRUE ,eval=FALSE, cache=FALSE, results='hide'}
result2<-CDSeq(bulk_data = mixtureGEP,
cell_type_number = 2:10,
mcmc_iterations = 5,
dilution_factor = 1,
block_number = 1,
gene_length = as.vector(gene_length),
reference_gep = refGEP, # gene expression profile of pure cell lines
cpu_number = 1, # use multiple cores to save time. Set the cpu_number = length(cell_type_number) if there is enough cores.
print_progress_msg_to_file = 0)
```
### Use single cell to annotate CDSeq-estimated cell types
```{r example 4, echo=TRUE, eval=FALSE, cache=FALSE, results='hide'}
cdseq.result <- CDSeq::CDSeq(bulk_data = pbmc_mix,
cell_type_number = seq(3,12,3),
beta = 0.5,
alpha = 5,
mcmc_iterations = 700,
cpu_number = 4,
dilution_factor = 10)
cdseq.result.celltypeassign <- cellTypeAssignSCRNA(cdseq_gep = cdseq.result$estGEP, # CDSeq-estimated cell-type-specific GEPs
cdseq_prop = cdseq.result$estProp, # CDSeq-estimated cell type proportions
sc_gep = sc_gep, # PBMC single cell data
sc_annotation = sc_annotation,# PBMC single data annotations
sc_pt_size = 3,
cdseq_pt_size = 6,
seurat_nfeatures = 100,
seurat_npcs = 50,
seurat_dims=1:5,
plot_umap = 1,
plot_tsne = 0)
```
## Setting recommendations
We provide recommendations for parameter settings. Note that these recommendations are merely emperical and there is no theoretical justifications yet. User can tune the parameters based on specific applications and domain knowledges.
Parameters Recommended setting
--------------- ----------------------
beta 0.5
alpha 5
mcmc_iteration 700-2000
dilution_factor 2-10
gene_subset_size 200-500
block_number >5
Check vignette for more details and examples: *browseVignettes("CDSeq")*.
## Contact
email: [email protected]