Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate CellAssign metrics #231

Merged
merged 9 commits into from
Sep 1, 2023
Merged

Conversation

allyhawkins
Copy link
Member

Closes #226

This PR adds some analysis and exploration into metrics and/or plots to use for evaluating cell-type assignments obtained from CellAssign. I looked at the blood sample we previously annotated using references with/ without B-cells. In the first notebook, we showed that CellAssign appropriately assigns cells to B-cells with the full reference and to "other" when B-cells are pulled from the reference. I thought this would be a good positive control for looking at metrics.

I also compared those results to using a more "real" dataset and reference with the RMS data. I compared the approach we plan to use with PangloaDb (muscle + immune cell types) to the cell type assignment using marker genes identified from the same dataset. Here I view the marker gene reference as close to a "positive" control as I think we will get.

Here's what I looked at:

  • The median delta score, which is calculated by taking the top prediction score from CellAssign and subtracting the median prediction score for each cell. This is the same approach we are using for SingleR.
  • The distribution of the prediction scores across reference types and cell types. Here the prediction score is a probability, so I like seeing high probabilities associated with the assigned cell type.
  • The expression of marker genes found in the reference across all cell types. Here, I would expect marker genes to be highly expressed in cells assigned to that specific cell type and have lower expression in other cell types. I would think that cell types are less reliable when you see a mix of high expression of marker genes across cells that get assigned to multiple cell types. I think this is easier to interpret when we have a few cell types to choose from, as the plots can get a little overcrowded. Also, I recognize that there won't be a linear relationship between marker gene expression and cell type assignment, but CellAssign models marker gene expression and looks for cells with high expression of the specified marker genes.

In conclusion, plotting the probabilities (or the delta median) would probably be helpful when comparing the appropriateness of references. However, when we are looking at just a single reference and looking across cell types, I'm not totally sure how helpful this is. In an ideal world, I would want to see higher scores for cell types that CellAssign is more confident in, but I don't think that's the case.

Here's a copy of the rendered report:
04-cell-assign-delta-median.nb.html.zip

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me overall. There's some spots (beyond where I commented) where code could be simplified, but I don't think it's really worth the effort for this PR - what you have is totally fine for the scope/goal of what we're trying to accomplish here!

I mostly have high-level feedback:

  • Can you add this notebook to celltype_annotation/README.md?
  • A couple notebook organizational items:
    • Can you add some sub-headers into the notebook within each section to distinguish between blood & rms?
    • If you feel like it, you can turn off messages for the readr::read_stuff() chunks...so much garbage in HTML.
    • I'd suggest some reorganization to have sections go in order. Imo, it helps to get a sense of the actual "raw" data itself before "massaging" it into delta median.
        1. probability scores
        1. marker gene expression
        1. delta median
    • Can you add some plot titles to the delta median section? Hard to tell which is which reference for blood in particular.
    • For the probability score plots, can you add a text note that the black points is the median?
    • In general for plots with an "other" cell type category, I think it would be good to keep this level at the end, but that might be too much work for the scope/needs of this PR. So, you decide!

Finally, I started thinking what other quantities might be good to explore, preferably in a subsequent PR since this one is already quite large! All of these are open to discussion..

  • <Highest cell type probability> - <2nd highest cell type probability> for that given cell
  • <Highest cell type probability> - <sum of all other cell type probabilities> for that given cell
  • <Highest cell type probability> - <median cell type probability> for that given cell

Comment on lines 102 to 106
color_group = "celltype"){

# sina plot of delta median score for cellassign
# color_group is on the x-axis and used for calculating stats
delta_plot <- ggplot(celltype_results, aes(x = !!rlang::sym(color_group), y = median_delta)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting (and I think I'm right this time 😉 ) that if you make this argument not a string, you can use {{}} as in aes(x = {{color_group}}).
This is a small general comment though, no need to overhaul the code for something like this. If/when we move things over to scpca-nf, we can make more of those kinds of decisions then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I actually tried that first and it wasn't working... but I can try again!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't have to try! I'm a little curious what you tried that wasn't working though, since is definitely how {{}} is supposed to be used.. For example, here's the first blog post that came up when I googled it https://www.njtierney.com/post/2019/07/06/jq-bare-vars/

@allyhawkins
Copy link
Member Author

Thank you so much for looking at this! I did quite a bit of reorganization and also changed up some plots, because I wasn't a fan.

  • For the probabilities, I am now using ridge plots. This suggestion came from a conversation with Jackie earlier when we were discussing how to show confidence in individual cell type assignments. Personally I really like this plot (particularly in the RMS dataset since that's a more realistic example). Each cell type is a separate line and then I plotted the entire distribution of probabilities associated with that cell type. Then I color the plot by whether or not the probability was associated with the final cell type. I'm having trouble with the wording on how exactly to explain this, but generally we are looking for a larger difference between the top probability and all others. I even wonder if we want to use something similar to this for SingleR?
  • I switched the marker gene expression plots to look at marker genes in the assigned cell type vs. all other cells. I think these are easier to interpret, although I'm not sure that they really help that much. I think the ridge plots are better.
  • I moved the median delta to the end and kept those pretty much the same. However, the high median delta across the board makes me not confident about using that.

Also in regards to the comment about ordering and cleaning up the plots, I think we should save that for the actual QC report.

Here's an updated version of the notebook:
04-cell-assign-delta-median.nb.html.zip

@sjspielman
Copy link
Member

For the probabilities, I am now using ridge plots.

+1 for ridge plots! I agree they are much clearer. I wouldn't spend the time here, but want to note that moving forward if/when we use these in the QC report, we'll want to adjust height to avoid excessive overlap between rows.
It also seems that they are only a good viz here when there are a lot cell types; the blood plots are not particularly useful, and it's almost certainly because of the particulars with that dataset. Happily, I don't expect we'll run into this problem much, if ever, in the scpca data.

Something else I'm thinking for ridgeplots is whether it's possibly to cleanly and without adding excessive visual noise indicate the median probability for each cell type. The ggridges approach would be to add this to the ggplot object: stat_density_ridges(quantile_lines = TRUE, quantiles = 2) (added 2nd quantile line aka median). Can you try adding that to see how it looks, with the understanding that the answer might be "super bad"?

I switched the marker gene expression plots to look at marker genes in the assigned cell type vs. all other cells. I think these are easier to interpret, although I'm not sure that they really help that much.

I think I agree with you here. The plots are easier to interpret, but mostly they just tell us that CellAssign is doing what it is supposed to do! That's useful information for us, but not necessarily for a QC report for users.

I moved the median delta to the end and kept those pretty much the same. However, the high median delta across the board makes me not confident about using that.

Yup..!

@sjspielman
Copy link
Member

Noting also that ggridges is apparently a Seurat required dependency so it's already in the scpcaTools image.

@allyhawkins
Copy link
Member Author

I added a median line here and I think that looks good. I played around with trying to fix the overlap with scale, but was having issues getting it to look right so I think we may want to do that when we get to adding it into the QC report.

Here's the updated report:
04-cell-assign-delta-median.nb.html.zip

@sjspielman
Copy link
Member

Yeah, scale can be tricky (and possibly dependent on the data set...?), so very fair to wait on that!
In terms of the median line, I was actually thinking (but did not communicate this!) to show the overall median rather than one median per category (would have to override aes). But that would have looked super bad since there would have been a line right in the middle of 0-1 totally outside of any distribution. I think this median line in each group actually does look nice and gives at least some guidance about score distributions. Let's keep it!

I think this PR is pretty much set as well, and again we can stylize plots further in the actual QC report. Approval coming up!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identify a similar metric to the median delta for CellAssign
2 participants