Evaluate CellAssign metrics #231

allyhawkins · 2023-08-30T19:38:40Z

Closes #226

This PR adds some analysis and exploration into metrics and/or plots to use for evaluating cell-type assignments obtained from CellAssign. I looked at the blood sample we previously annotated using references with/ without B-cells. In the first notebook, we showed that CellAssign appropriately assigns cells to B-cells with the full reference and to "other" when B-cells are pulled from the reference. I thought this would be a good positive control for looking at metrics.

I also compared those results to using a more "real" dataset and reference with the RMS data. I compared the approach we plan to use with PangloaDb (muscle + immune cell types) to the cell type assignment using marker genes identified from the same dataset. Here I view the marker gene reference as close to a "positive" control as I think we will get.

Here's what I looked at:

The median delta score, which is calculated by taking the top prediction score from CellAssign and subtracting the median prediction score for each cell. This is the same approach we are using for SingleR.
The distribution of the prediction scores across reference types and cell types. Here the prediction score is a probability, so I like seeing high probabilities associated with the assigned cell type.
The expression of marker genes found in the reference across all cell types. Here, I would expect marker genes to be highly expressed in cells assigned to that specific cell type and have lower expression in other cell types. I would think that cell types are less reliable when you see a mix of high expression of marker genes across cells that get assigned to multiple cell types. I think this is easier to interpret when we have a few cell types to choose from, as the plots can get a little overcrowded. Also, I recognize that there won't be a linear relationship between marker gene expression and cell type assignment, but CellAssign models marker gene expression and looks for cells with high expression of the specified marker genes.

In conclusion, plotting the probabilities (or the delta median) would probably be helpful when comparing the appropriateness of references. However, when we are looking at just a single reference and looking across cell types, I'm not totally sure how helpful this is. In an ideal world, I would want to see higher scores for cell types that CellAssign is more confident in, but I don't think that's the case.

Here's a copy of the rendered report:
04-cell-assign-delta-median.nb.html.zip

sjspielman

The code looks good to me overall. There's some spots (beyond where I commented) where code could be simplified, but I don't think it's really worth the effort for this PR - what you have is totally fine for the scope/goal of what we're trying to accomplish here!

I mostly have high-level feedback:

Can you add this notebook to celltype_annotation/README.md?
A couple notebook organizational items:
- Can you add some sub-headers into the notebook within each section to distinguish between blood & rms?
- If you feel like it, you can turn off messages for the readr::read_stuff() chunks...so much garbage in HTML.
- I'd suggest some reorganization to have sections go in order. Imo, it helps to get a sense of the actual "raw" data itself before "massaging" it into delta median.
  - 1. probability scores
  - 1. marker gene expression
  - 1. delta median
- Can you add some plot titles to the delta median section? Hard to tell which is which reference for blood in particular.
- For the probability score plots, can you add a text note that the black points is the median?
- In general for plots with an "other" cell type category, I think it would be good to keep this level at the end, but that might be too much work for the scope/needs of this PR. So, you decide!

Finally, I started thinking what other quantities might be good to explore, preferably in a subsequent PR since this one is already quite large! All of these are open to discussion..

<Highest cell type probability> - <2nd highest cell type probability> for that given cell
<Highest cell type probability> - <sum of all other cell type probabilities> for that given cell
<Highest cell type probability> - <median cell type probability> for that given cell

celltype_annotation/analysis/04-cell-assign-delta-median.Rmd

sjspielman · 2023-08-31T14:24:29Z

celltype_annotation/utils/cellassign-helper-functions.R

+                              color_group = "celltype"){
+
+  # sina plot of delta median score for cellassign
+  # color_group is on the x-axis and used for calculating stats 
+  delta_plot <- ggplot(celltype_results, aes(x = !!rlang::sym(color_group), y = median_delta)) +


Just noting (and I think I'm right this time 😉 ) that if you make this argument not a string, you can use {{}} as in aes(x = {{color_group}}).
This is a small general comment though, no need to overhaul the code for something like this. If/when we move things over to scpca-nf, we can make more of those kinds of decisions then.

I swear I actually tried that first and it wasn't working... but I can try again!

Definitely don't have to try! I'm a little curious what you tried that wasn't working though, since is definitely how {{}} is supposed to be used.. For example, here's the first blog post that came up when I googled it https://www.njtierney.com/post/2019/07/06/jq-bare-vars/

celltype_annotation/analysis/04-cell-assign-delta-median.Rmd

Co-authored-by: Stephanie <[email protected]>

allyhawkins · 2023-08-31T19:38:20Z

Thank you so much for looking at this! I did quite a bit of reorganization and also changed up some plots, because I wasn't a fan.

For the probabilities, I am now using ridge plots. This suggestion came from a conversation with Jackie earlier when we were discussing how to show confidence in individual cell type assignments. Personally I really like this plot (particularly in the RMS dataset since that's a more realistic example). Each cell type is a separate line and then I plotted the entire distribution of probabilities associated with that cell type. Then I color the plot by whether or not the probability was associated with the final cell type. I'm having trouble with the wording on how exactly to explain this, but generally we are looking for a larger difference between the top probability and all others. I even wonder if we want to use something similar to this for SingleR?
I switched the marker gene expression plots to look at marker genes in the assigned cell type vs. all other cells. I think these are easier to interpret, although I'm not sure that they really help that much. I think the ridge plots are better.
I moved the median delta to the end and kept those pretty much the same. However, the high median delta across the board makes me not confident about using that.

Also in regards to the comment about ordering and cleaning up the plots, I think we should save that for the actual QC report.

Here's an updated version of the notebook:
04-cell-assign-delta-median.nb.html.zip

sjspielman · 2023-08-31T20:00:08Z

For the probabilities, I am now using ridge plots.

+1 for ridge plots! I agree they are much clearer. I wouldn't spend the time here, but want to note that moving forward if/when we use these in the QC report, we'll want to adjust height to avoid excessive overlap between rows.
It also seems that they are only a good viz here when there are a lot cell types; the blood plots are not particularly useful, and it's almost certainly because of the particulars with that dataset. Happily, I don't expect we'll run into this problem much, if ever, in the scpca data.

Something else I'm thinking for ridgeplots is whether it's possibly to cleanly and without adding excessive visual noise indicate the median probability for each cell type. The ggridges approach would be to add this to the ggplot object: stat_density_ridges(quantile_lines = TRUE, quantiles = 2) (added 2nd quantile line aka median). Can you try adding that to see how it looks, with the understanding that the answer might be "super bad"?

I switched the marker gene expression plots to look at marker genes in the assigned cell type vs. all other cells. I think these are easier to interpret, although I'm not sure that they really help that much.

I think I agree with you here. The plots are easier to interpret, but mostly they just tell us that CellAssign is doing what it is supposed to do! That's useful information for us, but not necessarily for a QC report for users.

I moved the median delta to the end and kept those pretty much the same. However, the high median delta across the board makes me not confident about using that.

Yup..!

sjspielman · 2023-08-31T20:02:44Z

Noting also that ggridges is apparently a Seurat required dependency so it's already in the scpcaTools image.

allyhawkins · 2023-09-01T15:08:50Z

I added a median line here and I think that looks good. I played around with trying to fix the overlap with scale, but was having issues getting it to look right so I think we may want to do that when we get to adding it into the QC report.

Here's the updated report:
04-cell-assign-delta-median.nb.html.zip

sjspielman · 2023-09-01T15:13:25Z

Yeah, scale can be tricky (and possibly dependent on the data set...?), so very fair to wait on that!
In terms of the median line, I was actually thinking (but did not communicate this!) to show the overall median rather than one median per category (would have to override aes). But that would have looked super bad since there would have been a line right in the middle of 0-1 totally outside of any distribution. I think this median line in each group actually does look nice and gives at least some guidance about score distributions. Let's keep it!

I think this PR is pretty much set as well, and again we can stylize plots further in the actual QC report. Approval coming up!

sjspielman

🚀

allyhawkins and others added 4 commits August 30, 2023 14:19

initial notebook exploring cell assign metrics

19f5422

add missing predictions and ref files

305693c

set figure size

ecc7e34

newline

811ba44

allyhawkins requested a review from sjspielman August 30, 2023 19:39

sjspielman reviewed Aug 31, 2023

View reviewed changes

allyhawkins and others added 4 commits August 31, 2023 11:38

Apply suggestions from code review

83c8e93

Co-authored-by: Stephanie <[email protected]>

add analysis notebook to readme

09929c8

reorganize and make some ridge plots

103ad54

use {{}} correctly

c7a7d52

allyhawkins requested a review from sjspielman August 31, 2023 19:38

allyhawkins mentioned this pull request Sep 1, 2023

Add SingleR delta median plot to QC report AlexsLemonade/scpca-nf#432

Merged

add median line

d434c7a

sjspielman approved these changes Sep 1, 2023

View reviewed changes

allyhawkins merged commit 03423c9 into main Sep 1, 2023
1 check passed

allyhawkins deleted the allyhawkins/delta-median-cellassign branch September 1, 2023 15:15

allyhawkins mentioned this pull request Sep 1, 2023

Add plot for CellAssign delta median score (or similar) to the cell type report AlexsLemonade/scpca-nf#411

Closed

sjspielman mentioned this pull request Sep 6, 2023

Celltype QC: Add CellAssign score distribution plot AlexsLemonade/scpca-nf#435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate CellAssign metrics #231

Evaluate CellAssign metrics #231

allyhawkins commented Aug 30, 2023

sjspielman left a comment

sjspielman Aug 31, 2023

allyhawkins Aug 31, 2023

sjspielman Aug 31, 2023

allyhawkins commented Aug 31, 2023

sjspielman commented Aug 31, 2023

sjspielman commented Aug 31, 2023

allyhawkins commented Sep 1, 2023

sjspielman commented Sep 1, 2023

sjspielman left a comment

Evaluate CellAssign metrics #231

Evaluate CellAssign metrics #231

Conversation

allyhawkins commented Aug 30, 2023

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Aug 31, 2023

Choose a reason for hiding this comment

allyhawkins Aug 31, 2023

Choose a reason for hiding this comment

sjspielman Aug 31, 2023

Choose a reason for hiding this comment

allyhawkins commented Aug 31, 2023

sjspielman commented Aug 31, 2023

sjspielman commented Aug 31, 2023

allyhawkins commented Sep 1, 2023

sjspielman commented Sep 1, 2023

sjspielman left a comment

Choose a reason for hiding this comment