whether to use deduplicated bam #29

coolbubu · 2019-01-03T02:56:16Z

When I ran Msisensor， I found the results are quite different between using the deduplicated bam and not deduplicated bam. I wonder which bam shoud be used , the deduplicated bam or not deduplicated bam .

not_dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
9739	1501	15.41

dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
8798	122	1.39

The text was updated successfully, but these errors were encountered:

liangkaiye · 2019-01-03T02:58:57Z

what is the data coverage? WGS or WES or targeted sequencing? When I ran Msisensor， I found the results are quite different between using the deduplicated bam and not deduplicated bam. I wonder which bam shoud be used , the deduplicated bam or not deduplicated bam . not_dedeplicated.bam Total_Number_of_Sites Number_of_Somatic_Sites % 9739 1501 15.41 dedeplicated.bam Total_Number_of_Sites Number_of_Somatic_Sites % 8798 122 1.39 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#29>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AB9s-8n9bqtREHkiLxof0R7RtG-3Ig8jks5u_XFQgaJpZM4ZngYf>.

coolbubu · 2019-01-03T03:14:15Z

It is WES , the mean coverage is 180 and the dup_ratio is 54.88%

Beifang · 2019-01-07T08:58:08Z

how did you remove duplicates ? looks like dup ratio is so high.

micknudsen · 2019-02-04T09:18:24Z

I have noticed the same behavior and now routinely msisensor on dedupped BAMs (obtained using samtools view -F 1024). The results are then much closer to MSI status obtained by orthogonal method.

ZhaoDanOnGitHub · 2019-03-01T15:02:37Z

You should use the deduplicated BAMs. In the end, you can get the correct results only by using the data that you think is the cleanest.

guodudou · 2019-04-11T14:51:12Z

I am wondering whether bam with marked duplicates is sufficient or I have to export deduplicated reads to a separate bam? Thanks!

micknudsen · 2019-04-12T06:10:33Z

Marking duplicates is not sufficient. There is often a notable difference between using a BAM file with duplicates marked and with duplicates removed.

guodudou · 2019-05-13T15:30:01Z

Thank you very much for the quick response! In addition, there is a closed issue where people suggested using coverage normalization. I find score slightly changes. But this classifies samples with score around cutoff point 3.5% differently. Do you have any suggestion? Many thanks!

Beifang · 2019-05-14T03:37:55Z

We suggest : MSI_H: msiscore >= 10%, MSI_L: 3.5% =< msiscore < 10%; MSS: msiscore < 3.5%

guodudou · 2019-05-14T15:49:27Z

Thank you very much for the great information! Do you suggest coverage normalize for normal and tumor samples? Thanks!

Beifang · 2019-05-15T03:45:23Z

We din't normalize the TCGA UCEC data ( msiscore: 3.5% ) in MSIsensor original version. You can test with or without normalization option. We suggest that you choose this option when normal and tumor coverage are very different.

guodudou · 2019-09-13T18:26:04Z

Thank you very much! Can you please specify how you implement coverage normalization and/or how normalization affects the the length distribution / msi calling? This is very important to me because with and without normalization classify my samples to MSI_H and MSI_L respectively. Thanks!

ZhaoDanOnGitHub · 2019-09-15T11:10:01Z

The difference in the depth of sequencing between tumor tissue and normal tissue will affect the judgment of whether the site is stable. Therefore, we normalize the read distribution so that the area of their distribution is in the same magnitude. The specific practices are as follows: compare the sequencing depth of normal tissues and tumor tissues and correct the sequencing data with a small depth, that is,
the number of supported reads after normalization of the site = the number of supported reads * (max / min).
Where max is the total number of supported reads of the tissue with a large depth of the site, and min is the total number of supported reads of the tissue with a smaller sequencing depth.

guodudou · 2019-09-16T15:38:05Z

Thank you very much, this is very clear! I plan to extract the coverages of tumor and normal samples at all possible MS loci that are qualified for MSI calling, then see whether I need to adopt "coverage normalization". Do you have a suggestion about what range of coverage difference between normal and tumor is good for using "coverage normalization"? Thanks!

This was referenced Nov 9, 2020

getting very different results for RunMsiSensor using duplicate-marked vs deduped bams mskcc/tempo#856

Closed

feature request: ignore duplicate-marked reads #57

Open

windtalker6 mentioned this issue Feb 12, 2022

inconsistent output between msisensor and msisensor2 niu-lab/msisensor2#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whether to use deduplicated bam #29

whether to use deduplicated bam #29

coolbubu commented Jan 3, 2019

liangkaiye commented Jan 3, 2019 via email

coolbubu commented Jan 3, 2019

Beifang commented Jan 7, 2019

micknudsen commented Feb 4, 2019

ZhaoDanOnGitHub commented Mar 1, 2019

guodudou commented Apr 11, 2019

micknudsen commented Apr 12, 2019

guodudou commented May 13, 2019

Beifang commented May 14, 2019

guodudou commented May 14, 2019

Beifang commented May 15, 2019

guodudou commented Sep 13, 2019

ZhaoDanOnGitHub commented Sep 15, 2019

guodudou commented Sep 16, 2019

whether to use deduplicated bam #29

whether to use deduplicated bam #29

Comments

coolbubu commented Jan 3, 2019

liangkaiye commented Jan 3, 2019 via email

coolbubu commented Jan 3, 2019

Beifang commented Jan 7, 2019

micknudsen commented Feb 4, 2019

ZhaoDanOnGitHub commented Mar 1, 2019

guodudou commented Apr 11, 2019

micknudsen commented Apr 12, 2019

guodudou commented May 13, 2019

Beifang commented May 14, 2019

guodudou commented May 14, 2019

Beifang commented May 15, 2019

guodudou commented Sep 13, 2019

ZhaoDanOnGitHub commented Sep 15, 2019

guodudou commented Sep 16, 2019