The scpca-nf
workflow includes the option to perform cell type annotation, as described in these instructions, with the reference-based method SingleR
and/or the marker-gene based method CellAssign
.
Several reference datasets and marker gene lists have been pre-compiled for use.
This document provides instructions for creating and using your own reference data set for SingleR
annotation or marker gene list for CellAssign
, as well as setting up scpca-nf
to run with these custom references.
To perform cell type annotation with your own references, you will need to follow these steps:
- Create the custom reference(s), for use with
SingleR
and/orCellAssign
- Create a cell type reference metadata file
- Run the workflow with your custom references
Table of Contents
- Create custom cell type references
- Create a cell type reference metadata file
- Run the workflow with custom references
Your SingleR
reference dataset should be a well-curated gene expression dataset that contains samples, or cells in the case of single-cell expression data, with known cell type annotations.
As described in the SingleR
book, SingleR
uses these reference expression values and associated labels to identify cells in your dataset with similar expression patterns.
The scpca-nf
workflow requires a pre-trained SingleR
model as built from your selected reference dataset.
Therefore, you will need to install the SingleR
package to prepare a reference model:
# install BiocManager, as needed, to install Bioconductor packages
# install.packages("BiocManager")
# install SingleR
BiocManager::install("SinglerR")
Trained model files should be saved as RDS files and named <singler_reference_name>_model.rds
, where <singler_reference_name>
is a string of your choosing.
Create this file by taking the following steps:
- Select and format an appropriate reference dataset
- Train your model based on your reference dataset with
SingleR::trainSingleR()
- Add additional fields to the established model object
- Export the trained model to a file named
<singler_reference name>_model.rds
Either bulk gene expression (microarray or RNA-seq) or single-cell datasets with annotations can be used as references. To the best of your ability, you should identify and use a reference dataset that contains cell types that you expect will also be present in your data. The reference dataset should be from the same organism as the sample you are analyzing.
Some resources you may find helpful for obtaining a suitable reference dataset are the Bioconductor packages celldex
and scRNA-seq
, and the Azimuth
database.
Note that the SingleR
reference datasets which scpca-nf
has pre-compiled for use are a selected subset of references from the celldex
package.
For use with SingleR
, the reference dataset must be formatted as either a SummarizedExperiment
or SingleCellExperiment
object, where each column represents a sample or cell, and each row represents a gene.
Rows should be named with Ensembl gene IDs.
For more information on creating these types of objects, please see the following resources:
- Constructing a
SummarizedExperiment
object - An introduction to the
SingleCellExperiment
class - Converting
Seurat
objects orAnnData
objects toSingleCellExperiment
objects
To train your model, use the following code, where
ref
is yourSummarizedExperiment
orSingleCellExperiment
reference dataset object with columns as samples or cells and rows as genes.labels
is a vector providing cell type labels for the sample or cells in the reference dataset. We encourage using Cell Ontology IDs for cell type labels when performing annotation. Ontology IDs provide a standardized terminology for cell types and also allow for more complex relationships between cell types. See here for additional considerations if you chose to provide ontology ids.restrict
should be a vector of Ensembl gene IDs corresponding to the genes in the mapping reference used byscpca-nf
. You can optionally use a subset of those genes, if you want the model to be focused on a particular set of genes.
singler_model <- SingleR::trainSingleR(
ref = <your reference SummarizedExperiment or SingleCellExperiment object>,
labels = <vector of labels for each sample or cell>,
restrict = <optionally, vector of gene ids that are in the mapping reference>
)
Once the model object is built, you can optionally add two fields to the model object:
reference_label
, a string with information about the types of labels you specifiedreference_name
, a string with your reference name
If you choose to add this information, use the following code:
singler_model$reference_label <- "<name describing your labels>"
singler_model$reference_name <- "<name of your singler model>"
If you do not provide this information, the final SingleCellExperiment
object produced by scpca-nf
will have NULL
values for the fields metadata(sce)$singler_reference_label
and metadata(sce)$singler_reference_name
, respectively, but no calculations will be affected.
If you chose to use ontology labels, you may wish to set the reference_label
field to the string "label.ont"
as shown below.
singler_model$reference_label <- "label.ont"
If you use "label.ont"
, you must also include a data frame named celltype_ontology_df
in the SingleR
model object, which contains two columns: ontology_id
and ontology_cell_names
.
The ontology_id
column should contain ontology id values corresponding to labels in your labels
vector, and the ontology_cell_names
column should contain the human-readable cell type names associated with each ontology ID (e.g., "B cell"
).
The ontoProc
package from Bioconductor
may be useful for compiling this information.
singler_model$celltype_ontology_df <- data.frame(
ontology_id = <vector of ontology ids corresponding to annotation labels>,
ontology_cell_names = <vector of matching cell type names>
)
Finally, export the trained model to a file named <singler_reference_name>_model.rds
:
model_file_name <- "<singler_reference_name>_model.rds"
readr::write_rds(singler_model, model_file_name)
The CellAssign
reference file is created by converting a list of marker genes for a set of cell type labels into a binary matrix with values of 0
and 1
.
This matrix should have all possible cell types as the columns and all possible genes, represented as Ensembl gene IDs, as the rows.
Values of 1
indicate that the given gene is a marker gene for the given cell type, and values of 0
indicate that the gene is not a marker gene for the cell type.
When compiling this information, be sure to use Ensembl gene IDs from the same organism as the sample you are analyzing!
Some resources that you might find helpful for compiling marker gene lists include PanglaoDB
, MSigDB
, CellMarker
, and singleCellBase
.
Note that the CellAssign
reference datasets which scpca-nf
has pre-compiled for use were derived from marker gene sets in PanglaoDB
.
CellAssign
reference files should be saved as TSV files and named <cellassign_reference_name>.tsv
, where <cellassign_reference_name>
is a string of your choosing.
This TSV file should have the following columns and values:
ensembl_id
, which contains Ensembl gene IDs.- One column for each cell type label to use for annotation.
All values in this column should be
1
or0
, indicating whether that row's Ensembl gene id is a marker gene for that cell type (1
) or not (0
) other
, which contains all values set to0
. IfCellAssign
cannot confidently annotate a cell with one of the provided cell type labels, then it will assign the label of"other"
, indicating that no marker genes were specifically expressed in that cell.
Once this file is created, save it as <cellassign_reference_name>.tsv
.
When performing cell type annotation, scpca-nf
requires a TSV metadata file with information about cell type references, as described in these instructions.
All libraries with the same scpca_project_id
will use the same reference dataset for each of SingleR
and CellAssign
, respectively.
The project cell type metadata file should contain these five columns with the following information:
column_id | contents |
---|---|
scpca_project_id |
Project ID matching values in the run metadata file |
singler_ref_name |
The name of your SingleR model which should match the <singler_reference_name> string you used when creating your SingleR reference RDS file. Use NA to skip SingleR annotation |
singler_ref_file |
The file name of your SingleR model, as <singler_reference_name>_model.rds . Use NA to skip SingleR annotation |
cellassign_ref_name |
The name of your CellAssign model which should match the <cellassign_reference_name> string you used when creating your CellAssign reference TSV file. Use NA to skip CellAssign annotation |
cellassign_ref_file |
The file name of your CellAssign model, as <cellassign_reference_name>.tsv . Use NA to skip CellAssign annotation |
There are several parameter values you will have to set either in your configuration file or at the command line when running scpca-nf
:
params.singler_models_dir
: The path/uri to the directory that contains your<singler_reference_name>_model.rds
. This value should not include the file name itself. For example, if the path to your file is/path/to/reference/<singler_reference_name>_model.rds
, only provide the value/path/to/reference
. You can skip this parameter if you are not performing cell type annotation withSingleR
.params.cellassign_reference_dir
: The path/uri to the directory that contains your<cellassign_reference_name>.tsv
. This value should not include the file name itself. For example, if the path to your file is/path/to/reference/<cellassign_reference_name>.tsv
, only provide the value/path/to/reference
. You can skip this parameter if you are not performing cell type annotation withCellAssign
.params.project_celltype_metafile
: The path/uri to your cell type metadata file, including the file name.
When running the workflow, be sure to use the --perform_celltyping
flag to perform cell type annotation.
You can run the workflow using the below code. The example below shows all parameters specified at the command line, but if specifying them in the configuration file you would be able to remove those lines.
nextflow run AlexsLemonade/scpca-nf \
--singler_models_dir <directory/containing/singler/model/file> \
--cellassign_reference/dir <directory/containing/cellassign/reference/file> \
--project_celltype_metafile <path/to/celltype metafile> \
--perform_celltyping
As described in these instructions, cell type annotation is, by default, not repeated if results already exist.
scpca-nf
determines if results already exist based on two criteria: i) the file name of the provided reference file has not changed, and ii) intermediate checkpoint files from previous runs of cell typing already exist.
For example, if you re-run the workflow with a renamed but unchanged reference file, cell type annotation will be repeated.
To force cell type annotation to repeat, for example if you retain the same reference file names but changed their contents, use the --repeat_celltyping
flag as well as the --perform_celltyping
flag at runtime:
nextflow run AlexsLemonade/scpca-nf \
--singler_models_dir <directory/containing/singler/model/file> \
--cellassign_reference/dir <directory/containing/cellassign/reference/file> \
--project_celltype_metafile <path/to/celltype metafile> \
--perform_celltyping \
--repeat_celltyping