Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SingleR delta median plot to QC report #432

Merged
merged 13 commits into from
Sep 6, 2023

Conversation

sjspielman
Copy link
Member

Stacked on #427
Closes #410

This PR adds the delta median plot for SingleR. Implementation notes:

  • I refactored (bonus this is a pun 🎉) some of the earlier QC report code since we'll want to use the annotation factor order for this plot as well. I removed code that sets the factor order from the celltype tables, and instead added a helper function to accomplish this while setting up celltypes_df. This way, all downstream code that uses this data frame inherits these levels!
  • I made this overall section generally about assessing cell types, and place it immediately after the cell type tables, and before the section with umaps + heatmaps. This way, we get a sense of reliability before diving into plots.
  • I'm using a sina plot here, so I had to set a seed. For now, I just put a seed into qc_report.rmd, but it might be preferable to use whatever seed is used for the overall workflow and pass that in as a parameter to the report? I'm not sure how much this really matters though for this situation.
  • I wrapped the labels that are over 30 characters, and this seems to really help with plot layout.
  • Let me know any feedback about my description of delta median too!

Report: qc_report.html.zip

Base automatically changed from sjspielman/409-qc-celltypes-umaps to development September 1, 2023 12:38
@allyhawkins
Copy link
Member

Thank you for working on this and apologies if this is really annoying, but... I do wonder if we should be doing something similar to what I did with the ridge plot for CellAssign. We would plot just the score and label by the top score and then everything else and look at the separation. I don't know if it would work quite as well because it's a score and not a probability, but I think it's worth a shot.

@sjspielman
Copy link
Member Author

sjspielman commented Sep 1, 2023

Thank you for working on this and apologies if this is really annoying, but... I do wonder if we should be doing something similar to what I did with the AlexsLemonade/sc-data-integration#231.

Not annoying at all! It's been a fun week spending lots of time plotting :) Let's see how it looks..

@sjspielman
Copy link
Member Author

sjspielman commented Sep 1, 2023

Here's a super quick side-by-side (well, stacked) comparison of sina vs ridgeplot, just to get a sense:

One one hand, I do like the ridgeplot more, but on the other hand, I'm not entirely it's usable (this may go for CellAssign as well...!) - for any categories that have <= 2 cells, nothing gets drawn and that's just how the algorithm works; >=3 points are needed to estimate the distribution.

I wonder if there's a good middle ground we could achieve here, since I really do like the ridgeplot more... Would it it make sense to show both sina + ridgeplot, and/or only show cells types with >=3 cells for the ridgeplot (we'd add text explaining this).
Very curious to hear your thoughts!

Edit - also, I wonder if it makes sense to show unknown cell types in this plot? Is it meaningful to show "confidence" for something that was unclassified? I'm starting to think we should exclude those cells?

Screenshot 2023-09-01 at 2 25 23 PM

@jaclyn-taroni
Copy link
Member

I do wonder if we should be doing something similar to what I did with the ridge plot for CellAssign. We would plot just the score and label by the top score and then everything else and look at the separation. I don't know if it would work quite as well because it's a score and not a probability, but I think it's worth a shot.

My interpretation of this comment was to plot the scores themselves, not the median delta values. Perhaps I got that wrong, but if we're going to plot the median delta, it's helpful to use a completely different style of plot IMO so folks know they're looking at something quite different.

@sjspielman
Copy link
Member Author

My interpretation of this comment was to plot the scores themselves, not the median delta values.

Ah no, I think you are right! Let's see..

@allyhawkins
Copy link
Member

My interpretation of this comment was to plot the scores themselves, not the median delta values.

Ah no, I think you are right! Let's see..

Yes Jackie is correct. I was thinking we would plot the actual scores themselves and then create a plot similar to the one below.
Screenshot 2023-09-01 at 2 54 59 PM

@sjspielman
Copy link
Member Author

sjspielman commented Sep 1, 2023

I think we want to be mindful of overly-discussing strategies in this PR, mostly because as comments build up things will be become harder to track & review. So, I'm going to open an issue that we can use to discuss visualization strategies, and then we can come back here to continue the PR.

Edit - issue for discussion opened in #434

@sjspielman
Copy link
Member Author

As discussed in #434, I've updated this to still visualize delta_median, but highlighting points that were pruned out. I've updated text in the plot preamble to match what the plot currently shows. Note that this involved a decent bit of wrangling, since we need to plot based on the full labels, not the pruned labels, in order to color by whether a cell was pruned or not. The points are pretty small and possibly tricky to see, but I think this is inevitable when visualizing this many data points (or more!).

qc_report.html.zip

Screenshot 2023-09-06 at 10 03 56 AM

@sjspielman
Copy link
Member Author

@allyhawkins, I can't re-request review here since only comments were left before, so this is my re-request ping :)

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good, I just had one clarifying question and a suggestion about adding a median point.

Comment on lines 183 to 184
new_levels <- levels(delta_median_df$celltype)
new_levels <- new_levels[-length(new_levels)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused what you are doing here? Do you need both or can you just use the first line without the second line since Unknown cell type shouldn't be here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's confusing! I realize it can be simplified too. I will add some comments. Here's what's happening:

  • Although there is no longer an "Unknown cell type" value in the data, that level still exists in the delta_median_df$celltype variable
  • This doesn't matter for plotting though! One could proceed to just plot, and the x-axis order would be fine. But, it does matter if I want to wrap the labels, since cell type names are very long.
  • So, this code was setting up to wrap the labels while also getting rid of the Unknown level.
  • Looking again with fresh eyes, we really don't need to get rid of the Unknown level though! So, I will simplify to this:
# add column with ordered levels with wrapped labels for visualization
delta_median_df$annotation_wrapped <- factor(
  delta_median_df$celltype,
  levels = levels(delta_median_df$celltype),
  labels = stringr::str_wrap(levels(delta_median_df$celltype), 30)
)```

legend.title = element_text(size = rel(0.75)),
legend.text = element_text(size = rel(0.75)),
legend.position = "bottom"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to add a median point here too? I'm not sure what color though since red is being used for the cells that were pruned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think blue would probably be fine for median. One question though is how this stat should deal with the current grouping. I feel like it would be best if the median only reflected the black points? Any thoughts?

Also, do you think it would be too busy to also make the red points a different shape, like a diamond or something? It might make them easier to spot?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be best if the median only reflected the black points? Any thoughts?

This makes sense to me.

Also, do you think it would be too busy to also make the red points a different shape, like a diamond or something? It might make them easier to spot?

I don't think I would make them a different color and a different shape, that feels like it might be a lot. I might make the median a different shape or a line though.

@sjspielman
Copy link
Member Author

I've updated the plot as discussed and simplified that factor code, so this is ready for another look!
qc_report.html.zip

One important bit: In 601087b, I made some updates which could be reverted. This commit sets things up if we want to pass in the workflow seed to the QC report, for the sina plot layout. But for this to work, we'd need some small changes over in scpcaTools::generate_qc_report().
If we want to take this route for the seed then, two ways forward:

  • make scpcaTools compatible, then merge this PR
  • revert that commit, hardcode the seed for now in this PR. Later, we could circle back with a new PR to set the seed from the workflow seed, after making scpcaTools compatible

@jashapiro
Copy link
Member

I've updated the plot as discussed and simplified that factor code, so this is ready for another look! qc_report.html.zip

One important bit: In 601087b, I made some updates which could be reverted. This commit sets things up if we want to pass in the workflow seed to the QC report, for the sina plot layout. But for this to work, we'd need some small changes over in scpcaTools::generate_qc_report(). If we want to take this route for the seed then, two ways forward:

  • make scpcaTools compatible, then merge this PR
  • revert that commit, hardcode the seed for now in this PR. Later, we could circle back with a new PR to set the seed from the workflow seed, after making scpcaTools compatible

You should be able to use the extra_params argument to scpcaTools::generate_qc_report() to pass in the seed. It is there just so we don't need to update the function every time we make changes to the template!

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I will hold off on approving though until we had in the seed argument in scpcaTools.

bin/sce_qc_report.R Show resolved Hide resolved
)
}

prepare_annotation_values(cellassign_celltype_annotation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be assigned to a variable?

Copy link
Member Author

@sjspielman sjspielman Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is! see line 79 (not part of diff) too :)

delta_median_df <- tibble::tibble(
delta_median = rowMaxs(singler_scores) - rowMedians(singler_scores),
# Need to grab the non-pruned label for this plot
ontology = metadata(processed_sce)$singler_result$labels,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the ontology id or the ontology id label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the labels that were actually assigned which are ontology ids. I needed to grab this vector since we don't want the pruned labels for this plot. But, then I need to make sure we don't actually use ontology ids in the plot, but the actually cell names.

All that said, I realize I need to tweak some things here to make sure this works if, for some reason, ontology ids weren't used for singler annotation..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to this end in 6ba6351 (plus bonus forcats cleanup code from @jashapiro)

bin/sce_qc_report.R Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <[email protected]>
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@sjspielman sjspielman merged commit 4efef62 into development Sep 6, 2023
3 checks passed
@sjspielman sjspielman deleted the sjspielman/410-qc-singler-median-delta branch September 6, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants