Skip to content
This repository has been archived by the owner on Jan 6, 2021. It is now read-only.

whether to use deduplicated bam #29

Open
coolbubu opened this issue Jan 3, 2019 · 14 comments
Open

whether to use deduplicated bam #29

coolbubu opened this issue Jan 3, 2019 · 14 comments

Comments

@coolbubu
Copy link

coolbubu commented Jan 3, 2019

When I ran Msisensor, I found the results are quite different between using the deduplicated bam and not deduplicated bam. I wonder which bam shoud be used , the deduplicated bam or not deduplicated bam .

not_dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
9739	1501	15.41

dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
8798	122	1.39
@liangkaiye
Copy link
Contributor

liangkaiye commented Jan 3, 2019 via email

@coolbubu
Copy link
Author

coolbubu commented Jan 3, 2019

It is WES , the mean coverage is 180 and the dup_ratio is 54.88%

@Beifang
Copy link
Collaborator

Beifang commented Jan 7, 2019

how did you remove duplicates ? looks like dup ratio is so high.

@micknudsen
Copy link

I have noticed the same behavior and now routinely msisensor on dedupped BAMs (obtained using samtools view -F 1024). The results are then much closer to MSI status obtained by orthogonal method.

@ZhaoDanOnGitHub
Copy link

You should use the deduplicated BAMs. In the end, you can get the correct results only by using the data that you think is the cleanest.

@guodudou
Copy link

I am wondering whether bam with marked duplicates is sufficient or I have to export deduplicated reads to a separate bam? Thanks!

@micknudsen
Copy link

Marking duplicates is not sufficient. There is often a notable difference between using a BAM file with duplicates marked and with duplicates removed.

@guodudou
Copy link

Thank you very much for the quick response! In addition, there is a closed issue where people suggested using coverage normalization. I find score slightly changes. But this classifies samples with score around cutoff point 3.5% differently. Do you have any suggestion? Many thanks!

@Beifang
Copy link
Collaborator

Beifang commented May 14, 2019

We suggest : MSI_H: msiscore >= 10%, MSI_L: 3.5% =< msiscore < 10%; MSS: msiscore < 3.5%

@guodudou
Copy link

Thank you very much for the great information! Do you suggest coverage normalize for normal and tumor samples? Thanks!

@Beifang
Copy link
Collaborator

Beifang commented May 15, 2019

We din't normalize the TCGA UCEC data ( msiscore: 3.5% ) in MSIsensor original version. You can test with or without normalization option. We suggest that you choose this option when normal and tumor coverage are very different.

@guodudou
Copy link

Thank you very much! Can you please specify how you implement coverage normalization and/or how normalization affects the the length distribution / msi calling? This is very important to me because with and without normalization classify my samples to MSI_H and MSI_L respectively. Thanks!

@ZhaoDanOnGitHub
Copy link

The difference in the depth of sequencing between tumor tissue and normal tissue will affect the judgment of whether the site is stable. Therefore, we normalize the read distribution so that the area of their distribution is in the same magnitude. The specific practices are as follows: compare the sequencing depth of normal tissues and tumor tissues and correct the sequencing data with a small depth, that is,
the number of supported reads after normalization of the site = the number of supported reads * (max / min).
Where max is the total number of supported reads of the tissue with a large depth of the site, and min is the total number of supported reads of the tissue with a smaller sequencing depth.

@guodudou
Copy link

Thank you very much, this is very clear! I plan to extract the coverages of tumor and normal samples at all possible MS loci that are qualified for MSI calling, then see whether I need to adopt "coverage normalization". Do you have a suggestion about what range of coverage difference between normal and tumor is good for using "coverage normalization"? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants