doppelmark is a high-performance duplicate sequencing read marking tool for marking PCR and optical(pad-hopping) duplicate reads. It is functionally equivalent to the picard and sambamba duplicate marking tools, but runs much more efficiently and takes advantage of multi-core hardware. For some workloads and hardware, doppelmark is 100x faster than picard, and 7x faster than sambamba.
doppelmark achieves its speedup by dividing the input into shards and running the shards in parallel. Each shard includes input decompression, duplicate marking, and compression of the resulting output data. It detects duplicates without sorting all records. For a detailed description of the algorithm and design, see doc.go.
- doppelmark: High-performance duplicate marking tool