-
Notifications
You must be signed in to change notification settings - Fork 25
Contig Assembly
Contig assembly is done with the mccortex31 contigs
command, followed by the mccortex31 rmsubstr
command to remove contigs that are substrings of other contigs.
Contig scaffolding is left as an exercise for the reader.
Contigs are given a confidence score between 0 and 1. This is one minus the probability of missing a read that would contradict the assembled contig. It is a conservative score and gives the probability of a missing a read during sequencing, assuming the contig is not correct and the missing read would have given us this information. We get a confidence score for each side of a contig.
Genome size, along with mean read size (pull out of .ctx
/.ctp
files) are used to calculate contig confidence. Genome size can be guessed from number of kmers in the graph if most of your genome has copy number 1 (e.g. humans). If this is not the case, you can pass genome size with the -G,--genome <G>
argument.
You can use -C, --confid-cumul <C>
or -T, --confid-step <T>
arguments to prematurely halt assembly based on these contig confidence measures. We do not recommend using this arguments, as McCortex assembly is quite conservative, and contig mistakes can be identified better in downstream steps such as mapping reads back onto your contigs.
In addition to a confidence score, we also get the reason the contig assembly stopped. This is one of:
- FailNoCovg - no coverage
- FailNoColCovg - coverage in population but graph forks and sample has no coverage in any nodes
- FailNoPaths - fork in sample and no paths
- FailSplitPaths - oldest paths split at fork
- FailMissingPaths - a fork where one node has no path information
- HitRepeat - we detected that we were stuck in an infinite loop. Infinite loop detection is probabilistic to improve speed and memory usage, so may be incorrect.