Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

UTSouthwesternDSSR · 2024-10-23T04:01:47Z

Purpose/implementation Section

This is the follow-up of previous PR (#815) on exploring the results of inferCNV on non-ETP samples, using the new B cells identified from the stringent ScType B cell strategy (i.e. passing the cutoff of 99 percentile of non-B ScType score on B cell clusters) as the normal cells in running inferCNV.

Please link to the GitHub issue that this pull request addresses.

#811

What is the goal of this pull request?

Trying to distinguish tumor cells from normal using inferCNV method

Briefly describe the general approach you took to achieve this goal.

I used the new B cells identified from the stringent ScType B cell strategy, as the normal cells in running inferCNV. I tried to annotate the tumor cells following the approach implemented in Ewing samples (04-infercnv.html) [the script is not shown in the repository], but it doesn't really work on my samples (result is shown in next section). So I am considering to look for non-malignant cells from the infercnv.png, although I am not really sure how to automate it.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/infercnv_output

What types of results does your code produce (e.g., table, figure)?

_infercnv.png and _run.final.infercnv_obj for each sample (except SCPCL000082, which does not have any new B cells)

What is your summary of the results?

I ran the inferCNV using the new B cells, following the codes implemented in Ewing:

infercnv_obj <- infercnv::run(infercnv_obj,
  cutoff = 0.1, # use 1 for smart-seq, 0.1 for 10x-genomics
  out_dir = file.path(scratch_dir,ind.lib), # save all intermediate files to scratch dir
  denoise = T, HMM = T,
  save_rds = T,
  num_threads = parallel::detectCores() - 1
)

But it gives me error in step 15 when running plot_cnv() as shown below. I found out that the above code is running with analysis_mode = "subclusters", and I manage to pick up from where it left off when I reran the code.

When I ran with analysis_mode = "samples", I did not encounter any error. I actually ran inferCNV on my own lab server, with 47 cores. It took ~10 hours (very rough estimate) to run subclusters mode vs ~2-3 hours for samples mode, and here is the results for SCPCL000076. I don't think there is much difference between these 2 different modes.

However, when I am trying to assign tumor/normal label based on inferCNV results, using both total number of CNVs and mean proportion, only the one ran with subclusters mode showed variation in the number of CNVs across observation cells. All observation cells have the same number of CNVs for samples mode. Following the similar approach, if we were to annotate tumor cells based on the total number of CNVs or mean proportion, Late Eryth will considered as tumor, but if we were to look at the infercnv.png, I believe that the Late Eryth (in blue parenthesis) are showing relatively more copy-number neutral profile than the other observations.

Here I am showing the 4 samples that were not doing well (due to lack of B cells). Using the relatively few new B cells identified, SCPCL000704 and SCPCL000077 are showing some sort of copy-number calls, as compared to SCPCL000710 and SCPCL000706.

As for SCPCL000703, indeed it is the cleanest sample, and we observe consistent predictions as from the CopyKat. I annotate the tumor/normal by visualizing on the infercnv.png and decide on the number of clusters, to tease out the non-malignant cells from the observations (in this case, it is cluster 3).

This involves some sort of manual work, but I am also not really sure how to proceed with inferCNV (samples mode), since the subclusters mode is taking too long to run, and there is no variation observed in the number of CNV for all observations when using samples mode.
Based on the biology of T-ALL, the known copy-number variations are deletion of chr13q, CDKN2A/B (chr9p), and loss of heterozygosity in chr6q (maybe also SUZ12 deletion [located on chr17q]) for some samples. Generally, I observed copy-number loss in chr6 in all samples, and SCPCL000703 shows chr5q deletion (observed in some subtypes of T-ALL). I haven't gone through the copy-number loss/gain in all samples in detail to check for their biology.

Please let me know what you think about this. Thank you!

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g., renv or conda) to run this code?
For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…nETP

UTSouthwesternDSSR · 2024-10-24T20:56:07Z

I looked into the inferCNV results a little more, and actually I believe that it gives a better results for some samples (SCPCL000081, SCPCL000706,SCPCL000080, and SCPCL000710). Based on the strategy I mentioned above, I think clusters 9 and 5 (leiden clusterID) from SCPCL000081 and SCPCL000706 respectively, are non-malignant, since there is the detection of "diseases ..." (marked by red arrows) in other clusters from the inferCNV when compared to the cluster we are interested in, using FindMarker() function and filtered by p_val_adj<0.05 in getting the enrichment.

Although SCPCL000706 is one of the samples that have few genes detected, it seems like the results is quite promising, due to clear separation of inferCNV cluster1?

Similarly, we observe the same enrichment results (for the red arrows part) for SCPCL000080, and SCPCL000710, although the clusters produced by inferCNV do not provide a clear separation on the umap.

As for the other samples, I did not observe a group that is relatively more copy-number neutral from the inferCNV.png. Please let me know what you think about this. Thank you!

jaclyn-taroni

Hi @UTSouthwesternDSSR, thanks for this contribution! We appreciate your patience while the organizing team was out of the office (#832).

My ten thousand-foot view comments are:

Unfortunately, I don’t know how much more you can automate. From your results, I expect some examination of individual samples to be necessary, and I think that’s what you got started in Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838 (comment).
The losses on chr 6 observed for every sample seem like they are likely to be an artifact (MHC genes; infercnv docs) and could be due to the use of B cells as the reference. That would make me hesitant to use the total number of CNVs alone (although that may not be possible when using analysis_mode = “samples” from your comment?).

Other comments that primarily influence the review of analyses/cell-type-nonETP-ALL-03/scripts/run_infercnv.R and analyses/cell-type-nonETP-ALL-03/scripts/00-make-gene-order-file.R:

I understand that analysis_mode = “subclusters” takes a long time (and possibly longer than you want to invest!), but do you think the error only occurs when samples have very few new B cells? Could we handle that error by only proceeding when the number of reference cells exceeds the kNN setting? That doesn’t solve the time problem, of course, but that could potentially be mitigated by running samples in parallel instead of sequentially.
- I am not totally convinced that the results for SCPCL000076 are equivalent regardless of the analysis mode choice.
Is there a reason not to use the infercnv script to set up the gene order file?
Does the infercnv::run() output in scratch give us enough information to check for biology (asking about the map_metadata_from_infercnv.txt file specifically)? Does the choice of analysis mode greatly influence our ability to dig into the alterations observed in individual libraries?

I’ll also add that I would be interested in what genes are overrepresented in the disease gene sets. If I understand you correctly, the expression of these genes is lower in the cluster you expect to include non-malignant cells. This is probably a comment for a future PR.

Please let us know if there’s anything you’d like to discuss or if there is any way we can help! Thanks again.

jaclyn-taroni · 2024-10-28T12:05:01Z

analyses/cell-type-nonETP-ALL-03/scripts/00-make-gene-order-file.R

Did you try using the gtf_to_position_file.py script that is part of infercnv? That appears to be the recommendation in the infercnv docs: https://github.com/broadinstitute/inferCNV/wiki/instructions-create-genome-position-file

Is there a reason a custom R script is required? Could you write a shell script that handles the download and runs the Python script?

I actually got the script from the Ewing modules and followed the approach there. I guess this is not the best way?

Oh, let's see if someone more involved with the Ewing module can weigh in! Cc: @jashapiro

I don't think there was any specific reason for the script; it should be a fairly simple transformation either way. The gtf_to_position_file.pyscript is not distributed with infercnv, which means that it would need to be downloaded and included somehow here... If we can confirm that both are doing the same thing, I am not sure we really need to worry too much about the method we use, but we might want to move this script to somewhere that multiple analyses could access it rather than making copies.

UTSouthwesternDSSR · 2024-10-28T16:54:57Z

There are 12 diseases events participating under the "Disease" (R-HSA-1643685), as shown in the image below:

I think there are a lot of genes getting involved (not sure if you are still interested to take a look?)

I guess my concern is more like should I do this manual cutting of dendrogram to find the "non-malignant" cells? Because it seems like only SCPCL000703 is showing the strong signal (I would say not much doubt on this sample, and the results is similar to the CopyKat). I am not sure how do you feel about the results for the other 4 samples I mentioned in the first comment. Technically, the signal is not very strong. I don't mind to use this approach to annotate the 31 ETP-ALL samples, if this approach makes sense and is valid.

I am not totally convinced that the results for SCPCL000076 are equivalent regardless of the analysis mode choice.
In response to the above comment, I guess I am trying to show that even with subclusters mode, the Late Eryth is identified as tumor cells, which I don't think is true. That's why I am not even sure whether it is worthy to invest the effort in using subclusters mode. I just tried with SCPCL000703 with 772 B cells provided as normal in the subclusters mode, the infercnv::run still breaks at the same step.

jaclyn-taroni · 2024-10-28T19:25:55Z

Thanks for your comments, @UTSouthwesternDSSR!

I would say, in general, my willingness to trust the inferCNV results for malignant vs. non-malignant calls hinges on our ability to use it to detect — at least for some samples — changes we’d expect to see in T-ALL rather than the clusters from inferCNV. I am concerned that those clusters may be influenced by the chromosome 6 losses, which may be an artifact (and might show up in all human blood samples regardless of choice of reference — I don’t have any experience to say either way!). My thoughts about the marker genes are similar. Do we see the presence of known T-ALL marker genes if such gene sets exist? (I guess another way to get at this would be to understand how the chromosome 6 losses impact the clustering results, but that seems like more work to me, not less!)

I’m not sure if digging into known/previously observed alterations is technically feasible, but it is why I am asking about our ability to look at individual samples’ biology and if our choice of analysis mode impacts that.

UTSouthwesternDSSR · 2024-10-29T20:36:13Z

I guess it is hard to decide based on the biology, since the detected copy number alterations (deletion in chr 9, LOH in chr6 etc) do not present in all tumor samples. So when these are not detected, I cannot verify whether is the method not working, or these are not tumor cells, or these are tumor cells that don't have CNA. I tried SCEVAN for few samples, and they gives results similar to CopyKat. Generally, for those samples like SCPCL000703 with clear separation, the results is similar regardless which method (CopyKat, inferCNV, SCEVAN) we used. And I think that both pediatric cancers and leukemias are known to harbor a minimal number of genomic alterations. So inferring CNV may not be the right approach to identify tumor cells in pediatric T-ALL.

Since the due date is coming, I will just re-run the CopyKat with the stringent set of B cells as the normal, although I don't think the results will be changed significantly.

…nETP

jaclyn-taroni

Hi @UTSouthwesternDSSR,

A few changes need to be made to the analyses/cell-type-nonETP-ALL-03/scripts/writeout_submission.R script before we can accept this submission.

I share your skepticism about solely using a tool to look at CNAs for tumor vs. normal calls in these samples – and I haven’t seen the SCEVAN results that agree – but I would like to get this in before the eligibility deadline.

jaclyn-taroni · 2024-10-30T20:54:47Z

analyses/cell-type-nonETP-ALL-03/scripts/writeout_submission.R

+writeout <- function(ind.lib, ct.colors = ct_color, project.ID = projectID, n.row = 1){
+  seu <- readRDS(file.path(out_loc,"results/rds",paste0(ind.lib,".rds")))
+  voi <- c('newB.copykat.pred','sctype_classification')
+  changeName.voi <- c('tumor_cell_classification','cell_type_assignment')


On the tumor_cell_classification column in the submission guidelines:

The values of this column should be either "tumor" or "normal."

Thank you for the kind reminder! Is it okay, if I use "tumor", "normal", or "unknown", because there are some cells labeled as "not.defined" from CopyKat?

Yes, including "unknown" sounds good.

jaclyn-taroni · 2024-10-30T21:04:19Z

analyses/cell-type-nonETP-ALL-03/scripts/writeout_submission.R

+  final.df <- data.frame(scpca_sample_id=rep(project.ID, nrow(voi_df)), voi_df, 
+                         CL_ontology_id=gene.df$ontologyID[match(voi_df$cell_type_assignment,gene.df$cellName)])
+  write.table(final.df, sep = "\t", quote = F, row.names = F,
+              file = file.path(out_loc,"results/submission_table",paste0(ind.lib,"_metadata.tsv")))


Please include creating results/submission_table earlier in the script. I believe this is the cause of the current CI failure.

…nETP

jashapiro · 2024-10-31T15:18:13Z

.github/workflows/run_cell-type-nonETP-ALL-03.yml

          Rscript scripts/02-03_annotation.R
          Rscript scripts/04_multipanel_plot.R
          Rscript scripts/05_cluster_evaluation.R
+          Rscript scripts/06_sctype_exploration.R
+          Rscript scripts/07_run_copykat.R
+          Rscript scripts/markerGenes_submission.R
+          Rscript scripts/writeout_submission.R


Hi @UTSouthwesternDSSR,

I was looking at your PR trying to figure out the previous CI failure. It may be due to the small number of cells in the test data, but it may also be a stochastic failure, as I was able to run the whole workflow on a separate machine with the same test data. Nonetheless, it might be helpful for future debugging to add a few info messages like the ones below to help figure out where we are in the CI process.

Since you just added a small change, I will wait to see if that passes before proceeding too far!

Suggested change

Rscript scripts/02-03_annotation.R

Rscript scripts/04_multipanel_plot.R

Rscript scripts/05_cluster_evaluation.R

Rscript scripts/06_sctype_exploration.R

Rscript scripts/07_run_copykat.R

Rscript scripts/markerGenes_submission.R

Rscript scripts/writeout_submission.R

printf "\n\nRunning 02-03_annotation.R\n"

Rscript scripts/02-03_annotation.R

printf "\n\nRunning 04_multipanel_plot.R\n"

Rscript scripts/04_multipanel_plot.R

printf "\n\nRunning 05_cluster_evaluation.R\n"

Rscript scripts/05_cluster_evaluation.R

printf "\n\nRunning 06_sctype_exploration.R\n"

Rscript scripts/06_sctype_exploration.R

printf "\n\nRunning 07_run_copykat.R\n"

Rscript scripts/07_run_copykat.R

printf "\n\nRunning markerGenes_submission.R\n"

Rscript scripts/markerGenes_submission.R

printf "\n\nRunning writeout_submission.R\n"

Rscript scripts/writeout_submission.R

It looks like this is working now, so I just merged in the main branch. Whether you want to include the changes suggested above is up to you.

Sure, thank you! I think it is good to add some comments too. I am still doing some minor change on the script and output.

…nETP

jaclyn-taroni

I checked the submission tables in the results bucket, and this is passing CI. I will approve this to mark it as approved before tomorrow's deadline.

UTSouthwesternDSSR and others added 3 commits October 23, 2024 02:41

adding inferCNV part

412d8fb

Merge remote-tracking branch 'origin/main' into UTSouthwesternDSSR/no…

b802413

…nETP

Merge branch 'AlexsLemonade:main' into main

c3c6b8c

UTSouthwesternDSSR requested a review from jaclyn-taroni as a code owner October 23, 2024 04:01

jaclyn-taroni added 2 commits October 25, 2024 14:39

Add jags to system dependencies installation

9a33d0b

Add Rhtslib installation step separately to Dockerfile

7f532da

jaclyn-taroni reviewed Oct 28, 2024

View reviewed changes

UTSouthwesternDSSR and others added 8 commits October 29, 2024 21:11

update scripts structure

a059dde

Merge remote-tracking branch 'origin/main' into UTSouthwesternDSSR/no…

0f075ed

…nETP

added marker genes table in final submission format

e31ab64

change directory structure

692be0d

change name

33b2be0

add script for rerun copykat

3588896

final submission script

32423db

Add new scripts to CI/CD

151559d

jaclyn-taroni reviewed Oct 30, 2024

View reviewed changes

UTSouthwesternDSSR added 3 commits October 30, 2024 21:54

update final submission

514a06d

Merge remote-tracking branch 'origin/main' into UTSouthwesternDSSR/no…

d56c5d6

…nETP

update submission script and output

b3478c7

jashapiro reviewed Oct 31, 2024

View reviewed changes

jashapiro and others added 4 commits October 31, 2024 10:34

Merge branch 'main' into main

8d9dd4c

update scripts and readme

7bde7e5

Merge remote-tracking branch 'origin/main' into UTSouthwesternDSSR/no…

9b36d48

…nETP

exploration plots for CopyKat prediction with fine-tuned B cells

74d9cad

jaclyn-taroni approved these changes Oct 31, 2024

View reviewed changes

UTSouthwesternDSSR mentioned this pull request Oct 31, 2024

Submission table for cell type and tumor classification of ETP T-ALL (SCPCP000003) #847

Merged

8 tasks

Merge branch 'main' into main

c3a7a7e

jaclyn-taroni merged commit 4974a67 into AlexsLemonade:main Nov 5, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

UTSouthwesternDSSR commented Oct 23, 2024 •

edited

Loading

UTSouthwesternDSSR commented Oct 24, 2024

jaclyn-taroni left a comment

jaclyn-taroni Oct 28, 2024

UTSouthwesternDSSR Oct 28, 2024

jaclyn-taroni Oct 28, 2024

jashapiro Oct 28, 2024

UTSouthwesternDSSR commented Oct 28, 2024

jaclyn-taroni commented Oct 28, 2024

UTSouthwesternDSSR commented Oct 29, 2024

jaclyn-taroni left a comment

jaclyn-taroni Oct 30, 2024

UTSouthwesternDSSR Oct 30, 2024

jaclyn-taroni Oct 30, 2024

jaclyn-taroni Oct 30, 2024

jashapiro Oct 31, 2024

jashapiro Oct 31, 2024

UTSouthwesternDSSR Oct 31, 2024

jaclyn-taroni left a comment

Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

Conversation

UTSouthwesternDSSR commented Oct 23, 2024 • edited Loading

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

UTSouthwesternDSSR commented Oct 24, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UTSouthwesternDSSR commented Oct 28, 2024

jaclyn-taroni commented Oct 28, 2024

UTSouthwesternDSSR commented Oct 29, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

UTSouthwesternDSSR commented Oct 23, 2024 •

edited

Loading