Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838

Merged
merged 21 commits into from
Nov 5, 2024

Conversation

UTSouthwesternDSSR
Copy link
Contributor

@UTSouthwesternDSSR UTSouthwesternDSSR commented Oct 23, 2024

Purpose/implementation Section

This is the follow-up of previous PR (#815) on exploring the results of inferCNV on non-ETP samples, using the new B cells identified from the stringent ScType B cell strategy (i.e. passing the cutoff of 99 percentile of non-B ScType score on B cell clusters) as the normal cells in running inferCNV.

Please link to the GitHub issue that this pull request addresses.

#811

What is the goal of this pull request?

Trying to distinguish tumor cells from normal using inferCNV method

Briefly describe the general approach you took to achieve this goal.

I used the new B cells identified from the stringent ScType B cell strategy, as the normal cells in running inferCNV. I tried to annotate the tumor cells following the approach implemented in Ewing samples (04-infercnv.html) [the script is not shown in the repository], but it doesn't really work on my samples (result is shown in next section). So I am considering to look for non-malignant cells from the infercnv.png, although I am not really sure how to automate it.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/infercnv_output

What types of results does your code produce (e.g., table, figure)?

_infercnv.png and _run.final.infercnv_obj for each sample (except SCPCL000082, which does not have any new B cells)

What is your summary of the results?

I ran the inferCNV using the new B cells, following the codes implemented in Ewing:

infercnv_obj <- infercnv::run(infercnv_obj,
  cutoff = 0.1, # use 1 for smart-seq, 0.1 for 10x-genomics
  out_dir = file.path(scratch_dir,ind.lib), # save all intermediate files to scratch dir
  denoise = T, HMM = T,
  save_rds = T,
  num_threads = parallel::detectCores() - 1
)

But it gives me error in step 15 when running plot_cnv() as shown below. I found out that the above code is running with analysis_mode = "subclusters", and I manage to pick up from where it left off when I reran the code.
Screenshot 2024-10-21 at 11 39 24 AM

When I ran with analysis_mode = "samples", I did not encounter any error. I actually ran inferCNV on my own lab server, with 47 cores. It took ~10 hours (very rough estimate) to run subclusters mode vs ~2-3 hours for samples mode, and here is the results for SCPCL000076. I don't think there is much difference between these 2 different modes.
Screenshot 2024-10-22 at 10 50 29 PM

However, when I am trying to assign tumor/normal label based on inferCNV results, using both total number of CNVs and mean proportion, only the one ran with subclusters mode showed variation in the number of CNVs across observation cells. All observation cells have the same number of CNVs for samples mode. Following the similar approach, if we were to annotate tumor cells based on the total number of CNVs or mean proportion, Late Eryth will considered as tumor, but if we were to look at the infercnv.png, I believe that the Late Eryth (in blue parenthesis) are showing relatively more copy-number neutral profile than the other observations.
Screenshot 2024-10-22 at 10 20 10 PM

Here I am showing the 4 samples that were not doing well (due to lack of B cells). Using the relatively few new B cells identified, SCPCL000704 and SCPCL000077 are showing some sort of copy-number calls, as compared to SCPCL000710 and SCPCL000706.
Screenshot 2024-10-22 at 10 25 57 PM

As for SCPCL000703, indeed it is the cleanest sample, and we observe consistent predictions as from the CopyKat. I annotate the tumor/normal by visualizing on the infercnv.png and decide on the number of clusters, to tease out the non-malignant cells from the observations (in this case, it is cluster 3).
Screenshot 2024-10-24 at 9 42 40 AM

This involves some sort of manual work, but I am also not really sure how to proceed with inferCNV (samples mode), since the subclusters mode is taking too long to run, and there is no variation observed in the number of CNV for all observations when using samples mode.
Based on the biology of T-ALL, the known copy-number variations are deletion of chr13q, CDKN2A/B (chr9p), and loss of heterozygosity in chr6q (maybe also SUZ12 deletion [located on chr17q]) for some samples. Generally, I observed copy-number loss in chr6 in all samples, and SCPCL000703 shows chr5q deletion (observed in some subtypes of T-ALL). I haven't gone through the copy-number loss/gain in all samples in detail to check for their biology.

Please let me know what you think about this. Thank you!

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g., renv or conda) to run this code?
For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@UTSouthwesternDSSR
Copy link
Contributor Author

I looked into the inferCNV results a little more, and actually I believe that it gives a better results for some samples (SCPCL000081, SCPCL000706,SCPCL000080, and SCPCL000710). Based on the strategy I mentioned above, I think clusters 9 and 5 (leiden clusterID) from SCPCL000081 and SCPCL000706 respectively, are non-malignant, since there is the detection of "diseases ..." (marked by red arrows) in other clusters from the inferCNV when compared to the cluster we are interested in, using FindMarker() function and filtered by p_val_adj<0.05 in getting the enrichment.
Screenshot 2024-10-24 at 3 47 08 PM
Screenshot 2024-10-24 at 3 46 50 PM
Although SCPCL000706 is one of the samples that have few genes detected, it seems like the results is quite promising, due to clear separation of inferCNV cluster1?

Similarly, we observe the same enrichment results (for the red arrows part) for SCPCL000080, and SCPCL000710, although the clusters produced by inferCNV do not provide a clear separation on the umap.
Screenshot 2024-10-24 at 3 49 59 PM
Screenshot 2024-10-24 at 3 50 50 PM

As for the other samples, I did not observe a group that is relatively more copy-number neutral from the inferCNV.png. Please let me know what you think about this. Thank you!

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UTSouthwesternDSSR, thanks for this contribution! We appreciate your patience while the organizing team was out of the office (#832).

My ten thousand-foot view comments are:

  • Unfortunately, I don’t know how much more you can automate. From your results, I expect some examination of individual samples to be necessary, and I think that’s what you got started in Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838 (comment).
  • The losses on chr 6 observed for every sample seem like they are likely to be an artifact (MHC genes; infercnv docs) and could be due to the use of B cells as the reference. That would make me hesitant to use the total number of CNVs alone (although that may not be possible when using analysis_mode = “samples” from your comment?).

Other comments that primarily influence the review of analyses/cell-type-nonETP-ALL-03/scripts/run_infercnv.R and analyses/cell-type-nonETP-ALL-03/scripts/00-make-gene-order-file.R:

  • I understand that analysis_mode = “subclusters” takes a long time (and possibly longer than you want to invest!), but do you think the error only occurs when samples have very few new B cells? Could we handle that error by only proceeding when the number of reference cells exceeds the kNN setting? That doesn’t solve the time problem, of course, but that could potentially be mitigated by running samples in parallel instead of sequentially.
    • I am not totally convinced that the results for SCPCL000076 are equivalent regardless of the analysis mode choice.
  • Is there a reason not to use the infercnv script to set up the gene order file?
  • Does the infercnv::run() output in scratch give us enough information to check for biology (asking about the map_metadata_from_infercnv.txt file specifically)? Does the choice of analysis mode greatly influence our ability to dig into the alterations observed in individual libraries?

I’ll also add that I would be interested in what genes are overrepresented in the disease gene sets. If I understand you correctly, the expression of these genes is lower in the cluster you expect to include non-malignant cells. This is probably a comment for a future PR.

Please let us know if there’s anything you’d like to discuss or if there is any way we can help! Thanks again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try using the gtf_to_position_file.py script that is part of infercnv? That appears to be the recommendation in the infercnv docs: https://github.com/broadinstitute/inferCNV/wiki/instructions-create-genome-position-file

Is there a reason a custom R script is required? Could you write a shell script that handles the download and runs the Python script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually got the script from the Ewing modules and followed the approach there. I guess this is not the best way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, let's see if someone more involved with the Ewing module can weigh in! Cc: @jashapiro

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there was any specific reason for the script; it should be a fairly simple transformation either way. The gtf_to_position_file.pyscript is not distributed with infercnv, which means that it would need to be downloaded and included somehow here... If we can confirm that both are doing the same thing, I am not sure we really need to worry too much about the method we use, but we might want to move this script to somewhere that multiple analyses could access it rather than making copies.

@UTSouthwesternDSSR
Copy link
Contributor Author

There are 12 diseases events participating under the "Disease" (R-HSA-1643685), as shown in the image below:
Screenshot 2024-10-28 at 11 20 03 AM
Screenshot 2024-10-28 at 11 20 19 AM
I think there are a lot of genes getting involved (not sure if you are still interested to take a look?)

I guess my concern is more like should I do this manual cutting of dendrogram to find the "non-malignant" cells? Because it seems like only SCPCL000703 is showing the strong signal (I would say not much doubt on this sample, and the results is similar to the CopyKat). I am not sure how do you feel about the results for the other 4 samples I mentioned in the first comment. Technically, the signal is not very strong. I don't mind to use this approach to annotate the 31 ETP-ALL samples, if this approach makes sense and is valid.

I am not totally convinced that the results for SCPCL000076 are equivalent regardless of the analysis mode choice.
In response to the above comment, I guess I am trying to show that even with subclusters mode, the Late Eryth is identified as tumor cells, which I don't think is true. That's why I am not even sure whether it is worthy to invest the effort in using subclusters mode. I just tried with SCPCL000703 with 772 B cells provided as normal in the subclusters mode, the infercnv::run still breaks at the same step.

@jaclyn-taroni
Copy link
Member

Thanks for your comments, @UTSouthwesternDSSR!

I would say, in general, my willingness to trust the inferCNV results for malignant vs. non-malignant calls hinges on our ability to use it to detect — at least for some samples — changes we’d expect to see in T-ALL rather than the clusters from inferCNV. I am concerned that those clusters may be influenced by the chromosome 6 losses, which may be an artifact (and might show up in all human blood samples regardless of choice of reference — I don’t have any experience to say either way!). My thoughts about the marker genes are similar. Do we see the presence of known T-ALL marker genes if such gene sets exist? (I guess another way to get at this would be to understand how the chromosome 6 losses impact the clustering results, but that seems like more work to me, not less!)

I’m not sure if digging into known/previously observed alterations is technically feasible, but it is why I am asking about our ability to look at individual samples’ biology and if our choice of analysis mode impacts that.

@UTSouthwesternDSSR
Copy link
Contributor Author

I guess it is hard to decide based on the biology, since the detected copy number alterations (deletion in chr 9, LOH in chr6 etc) do not present in all tumor samples. So when these are not detected, I cannot verify whether is the method not working, or these are not tumor cells, or these are tumor cells that don't have CNA. I tried SCEVAN for few samples, and they gives results similar to CopyKat. Generally, for those samples like SCPCL000703 with clear separation, the results is similar regardless which method (CopyKat, inferCNV, SCEVAN) we used. And I think that both pediatric cancers and leukemias are known to harbor a minimal number of genomic alterations. So inferring CNV may not be the right approach to identify tumor cells in pediatric T-ALL.

Since the due date is coming, I will just re-run the CopyKat with the stringent set of B cells as the normal, although I don't think the results will be changed significantly.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UTSouthwesternDSSR,

A few changes need to be made to the analyses/cell-type-nonETP-ALL-03/scripts/writeout_submission.R script before we can accept this submission.

I share your skepticism about solely using a tool to look at CNAs for tumor vs. normal calls in these samples – and I haven’t seen the SCEVAN results that agree – but I would like to get this in before the eligibility deadline.

writeout <- function(ind.lib, ct.colors = ct_color, project.ID = projectID, n.row = 1){
seu <- readRDS(file.path(out_loc,"results/rds",paste0(ind.lib,".rds")))
voi <- c('newB.copykat.pred','sctype_classification')
changeName.voi <- c('tumor_cell_classification','cell_type_assignment')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the tumor_cell_classification column in the submission guidelines:

The values of this column should be either "tumor" or "normal."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the kind reminder! Is it okay, if I use "tumor", "normal", or "unknown", because there are some cells labeled as "not.defined" from CopyKat?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, including "unknown" sounds good.

final.df <- data.frame(scpca_sample_id=rep(project.ID, nrow(voi_df)), voi_df,
CL_ontology_id=gene.df$ontologyID[match(voi_df$cell_type_assignment,gene.df$cellName)])
write.table(final.df, sep = "\t", quote = F, row.names = F,
file = file.path(out_loc,"results/submission_table",paste0(ind.lib,"_metadata.tsv")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include creating results/submission_table earlier in the script. I believe this is the cause of the current CI failure.

Comment on lines 92 to 98
Rscript scripts/02-03_annotation.R
Rscript scripts/04_multipanel_plot.R
Rscript scripts/05_cluster_evaluation.R
Rscript scripts/06_sctype_exploration.R
Rscript scripts/07_run_copykat.R
Rscript scripts/markerGenes_submission.R
Rscript scripts/writeout_submission.R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UTSouthwesternDSSR,

I was looking at your PR trying to figure out the previous CI failure. It may be due to the small number of cells in the test data, but it may also be a stochastic failure, as I was able to run the whole workflow on a separate machine with the same test data. Nonetheless, it might be helpful for future debugging to add a few info messages like the ones below to help figure out where we are in the CI process.

Since you just added a small change, I will wait to see if that passes before proceeding too far!

Suggested change
Rscript scripts/02-03_annotation.R
Rscript scripts/04_multipanel_plot.R
Rscript scripts/05_cluster_evaluation.R
Rscript scripts/06_sctype_exploration.R
Rscript scripts/07_run_copykat.R
Rscript scripts/markerGenes_submission.R
Rscript scripts/writeout_submission.R
printf "\n\nRunning 02-03_annotation.R\n"
Rscript scripts/02-03_annotation.R
printf "\n\nRunning 04_multipanel_plot.R\n"
Rscript scripts/04_multipanel_plot.R
printf "\n\nRunning 05_cluster_evaluation.R\n"
Rscript scripts/05_cluster_evaluation.R
printf "\n\nRunning 06_sctype_exploration.R\n"
Rscript scripts/06_sctype_exploration.R
printf "\n\nRunning 07_run_copykat.R\n"
Rscript scripts/07_run_copykat.R
printf "\n\nRunning markerGenes_submission.R\n"
Rscript scripts/markerGenes_submission.R
printf "\n\nRunning writeout_submission.R\n"
Rscript scripts/writeout_submission.R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is working now, so I just merged in the main branch. Whether you want to include the changes suggested above is up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you! I think it is good to add some comments too. I am still doing some minor change on the script and output.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the submission tables in the results bucket, and this is passing CI. I will approve this to mark it as approved before tomorrow's deadline.

@jaclyn-taroni jaclyn-taroni merged commit 4974a67 into AlexsLemonade:main Nov 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants