-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate CellAssign metrics #231
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me overall. There's some spots (beyond where I commented) where code could be simplified, but I don't think it's really worth the effort for this PR - what you have is totally fine for the scope/goal of what we're trying to accomplish here!
I mostly have high-level feedback:
- Can you add this notebook to
celltype_annotation/README.md
? - A couple notebook organizational items:
- Can you add some sub-headers into the notebook within each section to distinguish between blood & rms?
- If you feel like it, you can turn off messages for the
readr::read_stuff()
chunks...so much garbage in HTML. - I'd suggest some reorganization to have sections go in order. Imo, it helps to get a sense of the actual "raw" data itself before "massaging" it into delta median.
-
probability scores
-
marker gene expression
-
delta median
-
- Can you add some plot titles to the
delta median
section? Hard to tell which is which reference for blood in particular. - For the probability score plots, can you add a text note that the black points is the median?
- In general for plots with an "other" cell type category, I think it would be good to keep this level at the end, but that might be too much work for the scope/needs of this PR. So, you decide!
Finally, I started thinking what other quantities might be good to explore, preferably in a subsequent PR since this one is already quite large! All of these are open to discussion..
<Highest cell type probability> - <2nd highest cell type probability>
for that given cell<Highest cell type probability> - <sum of all other cell type probabilities>
for that given cell<Highest cell type probability> - <median cell type probability>
for that given cell
color_group = "celltype"){ | ||
|
||
# sina plot of delta median score for cellassign | ||
# color_group is on the x-axis and used for calculating stats | ||
delta_plot <- ggplot(celltype_results, aes(x = !!rlang::sym(color_group), y = median_delta)) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting (and I think I'm right this time 😉 ) that if you make this argument not a string, you can use {{}}
as in aes(x = {{color_group}})
.
This is a small general comment though, no need to overhaul the code for something like this. If/when we move things over to scpca-nf
, we can make more of those kinds of decisions then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear I actually tried that first and it wasn't working... but I can try again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely don't have to try! I'm a little curious what you tried that wasn't working though, since is definitely how {{}}
is supposed to be used.. For example, here's the first blog post that came up when I googled it https://www.njtierney.com/post/2019/07/06/jq-bare-vars/
Thank you so much for looking at this! I did quite a bit of reorganization and also changed up some plots, because I wasn't a fan.
Also in regards to the comment about ordering and cleaning up the plots, I think we should save that for the actual QC report. Here's an updated version of the notebook: |
+1 for ridge plots! I agree they are much clearer. I wouldn't spend the time here, but want to note that moving forward if/when we use these in the QC report, we'll want to adjust height to avoid excessive overlap between rows. Something else I'm thinking for ridgeplots is whether it's possibly to cleanly and without adding excessive visual noise indicate the median probability for each cell type. The
I think I agree with you here. The plots are easier to interpret, but mostly they just tell us that CellAssign is doing what it is supposed to do! That's useful information for us, but not necessarily for a QC report for users.
Yup..! |
Noting also that |
I added a median line here and I think that looks good. I played around with trying to fix the overlap with Here's the updated report: |
Yeah, scale can be tricky (and possibly dependent on the data set...?), so very fair to wait on that! I think this PR is pretty much set as well, and again we can stylize plots further in the actual QC report. Approval coming up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Closes #226
This PR adds some analysis and exploration into metrics and/or plots to use for evaluating cell-type assignments obtained from
CellAssign
. I looked at the blood sample we previously annotated using references with/ without B-cells. In the first notebook, we showed thatCellAssign
appropriately assigns cells to B-cells with the full reference and to "other" when B-cells are pulled from the reference. I thought this would be a good positive control for looking at metrics.I also compared those results to using a more "real" dataset and reference with the RMS data. I compared the approach we plan to use with PangloaDb (muscle + immune cell types) to the cell type assignment using marker genes identified from the same dataset. Here I view the marker gene reference as close to a "positive" control as I think we will get.
Here's what I looked at:
SingleR
.CellAssign
models marker gene expression and looks for cells with high expression of the specified marker genes.In conclusion, plotting the probabilities (or the delta median) would probably be helpful when comparing the appropriateness of references. However, when we are looking at just a single reference and looking across cell types, I'm not totally sure how helpful this is. In an ideal world, I would want to see higher scores for cell types that CellAssign is more confident in, but I don't think that's the case.
Here's a copy of the rendered report:
04-cell-assign-delta-median.nb.html.zip