-
Notifications
You must be signed in to change notification settings - Fork 14
Exploring pandora extra files
For every sample pandora
maps reads to, it will create all these files in the sample directory. This should be enough information to understand exactly how pandora
is mapping reads to PRGs. Each file is described here.
This is a standard file format, and its specification can be found here. Each extra SAM fields are explained in the header, so this explanation will be omitted here. Some particularities that pandora
assumes when creating the SAM file:
- The reference length (in @SQ header lines) and the
POS
field refer to the string representation of the PRGs.POS
is 1-based; - The only flags
pandora
might set are: 1)0x4
: segment unmapped; 2)0x10
:SEQ
being reverse complemented. If0x10
is set, then not onlySEQ
is reverse complemented, but also the left and right flanks (LF
andRF
fields) are reverse complemented and swapped; - The
CIGAR
string is composed only hard-clipping (H
), sequence match (=
) or sequence mismatches (X
) operations. All read bases before the first hit and after the last hit in the mapping are hard-clipped. Read bases covered by a hit are assigned=
, otherwiseX
.
Below is an excerpt of a pandora
SAM file:
@SQ SN:GC00006032 LN:312
@SQ SN:GC00010897 LN:491
@PG ID:pandora PN:pandora VN:0.9.2 CL: ../cmake-build-debug-coverage/pandora compare --debugging-files --threads 1 --genotype -o out/output_toy_example_no_denovo out/prgs/pangenome.prg.fa reads/read_index.tsv
@CO The reference length (in @SQ header lines) and the POS field refer to the string representation of the PRGs
@CO LF: left flank sequence, the sequence before the first mapped kmer, soft-clipped, max 30 bps
@CO RF: right flank sequence, the sequence after the last mapped kmer, soft-clipped, max 30 bps
@CO MP: number of minimizer matches on the plus strand
@CO MM: number of minimizer matches on the minus strand
@CO PP: Prg Paths of the cluster of hits: the PRG path of each hit in considered cluster of hits
@CO NM: Total number of mismatches in the quasi-alignment
@CO AS: Alignment score (number of matches)
@CO nn: Number of ambiguous bases in the quasi-alignment
@CO cm: Number of minimizers in the quasi-alignment
simulated_read_0 0 GC00010897 343 255 5H92=3H * 0 0 TGGCACGGCATGGGGGAGGTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGC * LF:Z:CGATC RF:Z:TGA MP:i:12 MM:i:0 PP:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}-> NM:i:0 AS:i:92 nn:i:92 cm:i:12
simulated_read_1 0 GC00006032 161 255 11H86=3H * 0 0 TGGCTAATCACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCACCAAAATGAAACACTGCGAC * LF:Z:TTCCGCCTCCC RF:Z:ATT MP:i:10 MM:i:0 PP:Z:3{[161, 169)[172, 173)[180, 186)}->2{[172, 173)[180, 194)}->1{[186, 201)}->1{[200, 215)}->1{[214, 229)}->1{[218, 233)}->1{[222, 237)}->3{[229, 237)[240, 241)[249, 255)}->1{[250, 265)}->2{[253, 267)[271, 272)}-> NM:i:0 AS:i:86 nn:i:86 cm:i:10
simulated_read_2 16 GC00006032 93 255 8H79=13H * 0 0 AAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCT * LF:Z:CCCGCGCTTATAT RF:Z:GTTTTTTA MP:i:0 MM:i:11 PP:Z:1{[93, 108)}->1{[86, 101)}->1{[82, 97)}->1{[79, 94)}->1{[67, 82)}->1{[66, 81)}->1{[61, 76)}->1{[58, 73)}->1{[52, 67)}->1{[43, 58)}->1{[29, 44)}-> NM:i:0 AS:i:79 nn:i:79 cm:i:11
This file is only produced with the --debugging-files
option. This file describes all minimizers pandora
found between reads and PRGs. Below is an excerpt of a pandora .minimatches
file. The columns are self-explanatory. read_start
and read_end
are 0-based:
kmer read read_start read_end read_strand prg prg_path prg_strand
TGGCACGGCATGGGG simulated_read_0 5 20 + GC00010897 1{[343, 358)} +
ACGGCATGGGGGAGG simulated_read_0 9 24 + GC00010897 3{[347, 360)[369, 370)[374, 375)} +
TGCCGACCTCCCCCA simulated_read_0 15 30 - GC00010897 3{[353, 360)[369, 370)[374, 381)} -
This file is only produced with the --debugging-files
option. This file describes which clusters pandora
defined with respect to the minimizers it found, described in the minimatches file. Below is an excerpt of a pandora .clusters_def_report
file:
read prg status cluster_size nb_of_repeated_mini nb_of_unique_mini length_based_threshold min_cluster_size distances_between_hits
simulated_read_0 GC00010897 accepted 12 0 12 1 10 4,6,10,3,12,8,8,3,5,11,7,
simulated_read_1 GC00006032 accepted 10 0 10 1 10 8,7,14,14,4,4,7,10,3,
simulated_read_2 GC00006032 accepted 11 0 11 1 10 7,4,3,12,1,5,3,6,9,14,
Non-self-explanatory column description:
-
status
: if the cluster wasaccepted
orrejected
. A cluster isrejected
ifnb_of_unique_mini < max(length_based_threshold, min_cluster_size)
; -
nb_of_repeated_mini
andnb_of_unique_mini
: number of repeated and unique minimisers, respectively; -
length_based_threshold
: a thresholdpandora
calculates based on the PRG, read length, error rate, etc; -
min_cluster_size
: the parameter--min-cluster-size
; -
distances_between_hits
: distance between minimizer hits in the read;
This file is only produced with the --debugging-files
option. This file describes which clusters pandora
kept or filtered out. This is a filter on the clusters described in the clusters definition report. Below is an excerpt of a pandora .clusters_filter_report
file. The columns are self-explanatory:
read prg cluster_size status
simulated_read_0 GC00010897 12 kept
simulated_read_1 GC00006032 10 kept
simulated_read_2 GC00006032 11 kept
This file is only produced with the --debugging-files
option. This file describes the minimizer alignment between a read and a PRG. When working on the base space, this is a standard file format, and its specification can be found here. The pandora .minipaf
file is similar to the standard PAF
, but works on the minimiser space, and it has less columns. Below is an excerpt of a pandora .minipaf
file:
qname qmlen qmstart qmend qmstrands prg nmatch alen mapq
simulated_read_0 12 0 11 + GC00010897 12 12 255
simulated_read_1 10 0 9 + GC00006032 10 10 255
simulated_read_2 11 0 10 - GC00006032 11 11 255
Non-self-explanatory column description:
-
qmlen
: number of minimisers in the query sequence; -
qmstart
: query minimiser start coordinate (0-based); -
qmend
:query minimiser end coordinate (0-based); -
qmstrands
:+
if query/target on the same strand;-
if opposite;?
if can't be inferred; -
nmatch
: number of matching minimisers between query and prg in the alignment; -
alen
: total number of minimisers between query and prg in the alignment; -
mapq
: not available (always 255);
These files are only produced with the --debugging-files
option. These files will be in the denovo
directory inside the sample directory. They describe the input files to minimap2
and racon
, and racon
output files:
-
<gene>.consensus.fa
: thepandora
consensus for the specified<gene>
. This is the sequence to be corrected byracon
; -
<gene>.reads.fa
: the segments of the reads, flanked by2*k
bases, thatpandora
inferred that map to the specified<gene>
; -
<gene>.minimap2.out.paf
: the result of mapping<gene>.reads.fa
to<gene>.consensus.fa
usingminimap2
; -
<gene>.consensus.racon.<X>_rounds.fa
: the resulting polished sequence by inputting<gene>.consensus.fa
and<gene>.minimap2.out.paf
toracon
.<X>
denotes the number of roundsracon
was run;