Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate CellAssign metrics #231

Merged
merged 9 commits into from
Sep 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions celltype_annotation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ In the notebook, cell type annotations using different sets of marker genes from

8. `copykat-rms-exploration.Rmd`: This notebook explores the usage of `CopyKAT` for identifying tumor vs. normal cells in Rhabdomyosarcoma samples from `SCPCP000005`.

9. `04-cell-assign-delta-median.Rmd`: This notebook examines ways to evaluate confidence in the cell type assignments that are obtained from using `CellAssign`.
The notebook uses the cell types annotated in `marker-gene-celltype-exploration.Rmd` ,`02-cell-assign-sarcoma-marker-genes.Rmd`, and `03-cell-assign-combined-markers.Rmd`.

## Scripts

This folder contains any scripts used for cell type annotation.
Expand Down
245 changes: 245 additions & 0 deletions celltype_annotation/analysis/04-cell-assign-delta-median.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
---
title: "CellAssign: Evaluate delta median metric"
output:
html_notebook:
toc: true
toc_float: true
code_folding: "hide"
---

This notebook examines ways to evaluate confidence in the cell type assignments that are obtained from using `CellAssign`.
In particular, we would like to see if we can use a similar delta median metric that we are using for `SingleR`.
The delta median metric is calculated by taking the top score or prediction from `CellAssign` and subtracting the median prediction for each cell in the dataset.

This notebook will look at `CellAssign` results from previous analysis.
In particular, we will read in the results from running `CellAssign` with a blood cancer dataset (see [`marker-gene-celltype-exploration.Rmd`](marker-gene-celltype-exploration.Rmd)) and a sarcoma dataset (see [`02-cell-assign-sarcoma-marker-genes.Rmd`](02-cell-assign-sarcoma-marker-genes.Rmd) and [`03-cell-assign-combined-markers.Rmd`](03-cell-assign-combined-markers.Rmd)).

For the blood cancer dataset, we used a follicular lymphoma reference with and without B-cells (referred to as `all_cells` or `no_bcells`).
For the sarcoma dataset, we used either the combination of cell types from immune and muscle in PanglaoDB (`combined_muscle`) and a reference obtained from calculating the marker gene expression for each submitter assigned cell type (`rms_markers`).

A few things to keep in mind:

- Here the output of `CellAssign` is a prediction, instead of a score.
This prediction value can be equated to a probability and is on a 0-1 scale.
From the [`CellAssign` paper](https://doi.org/10.1038/s41592-019-0529-1):

> Using this information, CellAssign employs a hierarchical statistical framework to compute the probability that each cell belongs to the modeled cell types, while jointly estimating all model parameters using an expectation-maximization inference algorithm.
allyhawkins marked this conversation as resolved.
Show resolved Hide resolved

- `CellAssign` looks for high expression of indicated marker genes.
It cannot assign classes based on low or medium expression of a marker gene.
This means that high marker gene expression is required for `CellAssign` to assign cells to the given cell type.


## Set Up

```{r message=FALSE}
library(SingleCellExperiment)
library(ggplot2)

# source helper functions
function_file <- file.path("..", "utils", "cellassign-helper-functions.R")
source(function_file)
```

```{r message=FALSE}
# Set up paths
cellassign_results_dir <- file.path("cellassign_results")

# blood predictions
all_cells_preds_file <- file.path(cellassign_results_dir, "SCPCL000295_all_cells_predictions.tsv")
no_bcells_preds_file <- file.path(cellassign_results_dir, "SCPCL000295_no_bcells_predictions.tsv")
blood_preds_files <- list(
all_cells = all_cells_preds_file,
no_bcells = no_bcells_preds_file
)

# sarcoma predictions
rms_markers_preds_file <- file.path(cellassign_results_dir, "SCPCL000488_rms_broad_markers.tsv")
comb_muscle_preds_file <- file.path(cellassign_results_dir, "SCPCL000488_panglao_combined.tsv")
rms_preds_files <- list(
combined_muscle = comb_muscle_preds_file,
rms_markers = rms_markers_preds_file
)
```

```{r message=FALSE}
# read in prediction files
blood_predictions <- purrr::map(blood_preds_files, readr::read_tsv)
rms_predictions <- purrr::map(rms_preds_files, readr::read_tsv)
```

```{r message=FALSE}
# grab annotated sce objects for use later
data_dir <- file.path("..", "data")
blood_annotated_rds <- file.path(data_dir, "SCPCS000221", "SCPCL000295_annotated.rds")
rms_annotated_rds <- file.path(data_dir, "SCPCS000262", "SCPCL000488_annotated.rds")

# read in annotated sce
blood_sce <- readr::read_rds(blood_annotated_rds)
rms_sce <- readr::read_rds(rms_annotated_rds)
```


```{r message=FALSE}
# read in reference files
# we will use these later to look at marker genes specific to each reference
ref_dir <- file.path("..", "references")
blood_all_cells_ref <- readr::read_csv(file.path(ref_dir, "FL_celltype_ensembl.csv"))
blood_no_bcells_ref <- readr::read_csv(file.path(ref_dir, "FL_celltype_ensembl-noBcells.csv"))
combined_muscle_ref <- readr::read_csv(file.path(ref_dir, "panglao_combined_mtx.csv"))
rms_ref <- readr::read_csv(file.path(ref_dir, "rms_broad_markers_mtx.csv"))
```

## Probability scores

Because `CellAssign` directly outputs probabilities, rather than an arbitrary score, let's just look directly at the probabilities first.

Here we will plot the entire distribution of probabilities for each cell type.
For each cell type, we label which cells were assigned to the cell type indicatd on the x-axis with the `assigned_celltype` label.
We are looking for all cells with high probability of being assigned to a given cell type to be assigned to that cell type.
Additionally, the probability of them being assigned to a different cell type should be low.

For all of these plots the probability will be shown on the x-axis and the cell type on the y-axis.
The color indicates the whether or not the probability was the top probability used to assign the cell type.

```{r}
# get the cell type assignments
# based on cell type with the highest probability
blood_celltype <- purrr::map(blood_predictions, get_celltype_assignments) |>
dplyr::bind_rows(.id = "reference")

rms_celltype <- purrr::map(rms_predictions, get_celltype_assignments) |>
dplyr::bind_rows(.id = "reference")
```


### Blood probability

```{r}
# separate by cell type
plot_probability(blood_celltype,
blood_predictions)
```

### RMS probability


```{r fig.height=5}
# separate by cell type
plot_probability(rms_celltype,
rms_predictions)
```

Overall it appears that there is a larger difference between the distribution of probabilities for the non-assigned cell type and the assigned cell type in sarcoma when using the `rms_markers` reference over the combined muscle reference.
This fits with the expectation that the cell types found in `rms_markers` are more appropriate than the cell types in the muscle reference.

## Marker gene expression

Because `CellAssign` strictly looks for increased expression of defined marker genes to assign cell types, here I specifically look at the expression of the marker genes in each reference across the assigned cell types.
I would expect that the marker genes for a cell type would be highly expressed in cells that are assigned to that cell type and have lower expression in other cell types.

However, it is not a linear relationship between marker genes and cell type assignment, so not having high expression in all genes may be expected.
Looking at these plots may highlight cell types that `CellAssign` is less confident in.
If there are cases with increased gene expression of a set of marker genes in two cell types, then you may expect that defining those two cell types would be more difficult.

Here we will only look at the RMS marker gene expression because the blood reference only has a very small number of marker genes.

```{r fig.height = 10, warning=FALSE,message=FALSE}
# look at rms marker gene expression
plot_marker_gene_exp(ref_mtx = rms_ref,
celltype_assignments = rms_celltype,
ref_name = "rms_markers",
annotated_sce = rms_sce) +
labs(title = "RMS Marker Genes Reference")
```

Here we see that expression of marker genes for non-tumor cells (e.g., Fibroblasts and Vascular Endothelium) is pretty defined to only the assigned cell types.
In the tumor cells (Tumor_Myoblast and Tumor_Myocyte specifically), there is high expression of both marker gene sets.
I would associated this with some of the misassignment that is happening between those two cell classes.
(see `02-cell-assign-sarcoma-marker-genes.Rmd`)

```{r fig.height = 25,warning=FALSE,message=FALSE}
# look at combined muscle reference
plot_marker_gene_exp(ref_mtx = combined_muscle_ref,
celltype_assignments = rms_celltype,
ref_name = "combined_muscle",
annotated_sce = rms_sce) +
labs(title = "RMS Combined muscle Reference")
```

The trend of marker genes being higher in the assigned cell type is present, but definitely not as clear with this reference.
I don't think this helps us discern which cell types have been confidently assigned.

## Median Delta

Below we will calculate the difference between the top prediction and the median prediction for every cell.
The prediction scores that come from `CellAssign` are equivalent to probabilities.
The higher the median delta score, the larger the difference between the probability of the cell type assigned being correct and the probability of other cell types being correct.

We will compare the distribution of delta median scores across references used and cell types.
If a reference is more appropriate, we would expect a higher distribution of delta median scores.

### Blood median delta

```{r}
# calculate the blood median delta probability
blood_median_delta <- purrr::map(blood_predictions, get_median_delta_cellassign) |>
dplyr::bind_rows(.id = "reference")

# combine into a single data frame for plotting
blood_results_df <- dplyr::left_join(blood_celltype, blood_median_delta)
```

```{r}
# compare references to each other
plot_median_delta(blood_results_df, color_group = reference)
```

```{r}
# separate by cell type
plot_median_delta(blood_results_df)
```

### RMS median delta

```{r}
# median delta for rms data
rms_median_delta <- purrr::map(rms_predictions, get_median_delta_cellassign) |>
dplyr::bind_rows(.id = "reference")

# combine into a single data frame
rms_results_df <- dplyr::left_join(rms_celltype, rms_median_delta)
```

```{r}
# compare across references used
plot_median_delta(rms_results_df, color_group = reference)
```

```{r}
# compare across cell types
# here cell types are different based on the reference used so we will split these into two plots
split(rms_results_df, rms_results_df$reference) |>
purrr::map(plot_median_delta)
```

When comparing the distribution of delta median scores for each of the references, we do see a difference between the two references for each organ type.
For blood, it looks like there's a higher concentration of delta median scores closer to 1 with the all cells reference vs. the reference without B-cells.
For sarcoma, the reference produced from using marker genes specific to the dataset rather than PanglaoDB is more appropriate, which is what we would expect.

However, if we break it down by cell type, we generally see that the delta median scores are high across each cell type.
This means that when a cell is called a specific type, the probability of the cell type being the best assignment is high.

## Conclusions

- The distribution of the probability appears to be helpful in displaying confidence in cell type cells.
Cell types with a large difference between the distribution of probabilities associated with the cell type call and the other probabilities are more likely to be good assignments.
- Looking at the marker gene expression non-tumor cells appear to have more distinct expression of marker genes than tumor cells.
- Looking at the delta median prediction/probability score appears to be high across all cell types, regardless of quality of cell type assignment, at least in the libraries and references examined here.

## Session Info

```{r}
sessionInfo()
```

3,472 changes: 3,472 additions & 0 deletions celltype_annotation/analysis/04-cell-assign-delta-median.nb.html

Large diffs are not rendered by default.

Loading