Skip to content

Exploring pandora extra files

leoisl edited this page Oct 24, 2022 · 5 revisions

Mapping

For every sample pandora maps reads to, it will create all these files in the sample directory. This should be enough information to understand exactly how pandora is mapping reads to PRGs. Each file is described here.

SAM (<sample>.filtered.sam)

This is a standard file format, and its specification can be found here. Each extra SAM fields are explained in the header, so this explanation will be omitted here. Some particularities that pandora assumes when creating the SAM file:

  1. The reference length (in @SQ header lines) and the POS field refer to the string representation of the PRGs. POS is 1-based;
  2. The only flags pandora might set are: 1) 0x4: segment unmapped; 2) 0x10: SEQ being reverse complemented. If 0x10 is set, then not only SEQ is reverse complemented, but also the left and right flanks (LF and RF fields) are reverse complemented and swapped;
  3. The CIGAR string is composed only hard-clipping (H), sequence match (=) or sequence mismatches (X) operations. All read bases before the first hit and after the last hit in the mapping are hard-clipped. Read bases covered by a hit are assigned =, otherwise X.

Below is an excerpt of a pandora SAM file:

@SQ	SN:GC00006032	LN:312
@SQ	SN:GC00010897	LN:491
@PG	ID:pandora	PN:pandora	VN:0.9.2	CL: ../cmake-build-debug-coverage/pandora compare --debugging-files --threads 1 --genotype -o out/output_toy_example_no_denovo out/prgs/pangenome.prg.fa reads/read_index.tsv 
@CO	The reference length (in @SQ header lines) and the POS field refer to the string representation of the PRGs
@CO	LF: left flank sequence, the sequence before the first mapped kmer, soft-clipped, max 30 bps
@CO	RF: right flank sequence, the sequence after the last mapped kmer, soft-clipped, max 30 bps
@CO	MP: number of minimizer matches on the plus strand
@CO	MM: number of minimizer matches on the minus strand
@CO	PP: Prg Paths of the cluster of hits: the PRG path of each hit in considered cluster of hits
@CO	NM: Total number of mismatches in the quasi-alignment
@CO	AS: Alignment score (number of matches)
@CO	nn: Number of ambiguous bases in the quasi-alignment
@CO	cm: Number of minimizers in the quasi-alignment
simulated_read_0	0	GC00010897	343	255	5H92=3H	*	0	0	TGGCACGGCATGGGGGAGGTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGC	*	LF:Z:CGATC	RF:Z:TGA	MP:i:12	MM:i:0	PP:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}->	NM:i:0	AS:i:92	nn:i:92	cm:i:12
simulated_read_1	0	GC00006032	161	255	11H86=3H	*	0	0	TGGCTAATCACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCACCAAAATGAAACACTGCGAC	*	LF:Z:TTCCGCCTCCC	RF:Z:ATT	MP:i:10	MM:i:0	PP:Z:3{[161, 169)[172, 173)[180, 186)}->2{[172, 173)[180, 194)}->1{[186, 201)}->1{[200, 215)}->1{[214, 229)}->1{[218, 233)}->1{[222, 237)}->3{[229, 237)[240, 241)[249, 255)}->1{[250, 265)}->2{[253, 267)[271, 272)}->	NM:i:0	AS:i:86	nn:i:86	cm:i:10
simulated_read_2	16	GC00006032	93	255	8H79=13H	*	0	0	AAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCT	*	LF:Z:CCCGCGCTTATAT	RF:Z:GTTTTTTA	MP:i:0	MM:i:11	PP:Z:1{[93, 108)}->1{[86, 101)}->1{[82, 97)}->1{[79, 94)}->1{[67, 82)}->1{[66, 81)}->1{[61, 76)}->1{[58, 73)}->1{[52, 67)}->1{[43, 58)}->1{[29, 44)}->	NM:i:0	AS:i:79	nn:i:79	cm:i:11

Minimatches file (<sample>.minimatches)

This file is only produced with the --debugging-files option. This file describes all minimizers pandora found between reads and PRGs. Below is an excerpt of a pandora .minimatches file. The columns are self-explanatory. read_start and read_end are 0-based:

kmer	read	read_start	read_end	read_strand	prg	prg_path	prg_strand
TGGCACGGCATGGGG	simulated_read_0	5	20	+	GC00010897	1{[343, 358)}	+
ACGGCATGGGGGAGG	simulated_read_0	9	24	+	GC00010897	3{[347, 360)[369, 370)[374, 375)}	+
TGCCGACCTCCCCCA	simulated_read_0	15	30	-	GC00010897	3{[353, 360)[369, 370)[374, 381)}	-

Clusters definition report (<sample>.clusters_def_report)

This file is only produced with the --debugging-files option. This file describes which clusters pandora defined with respect to the minimizers it found, described in the minimatches file. Below is an excerpt of a pandora .clusters_def_report file:

read	prg	status	cluster_size	nb_of_repeated_mini	nb_of_unique_mini	length_based_threshold	min_cluster_size	distances_between_hits
simulated_read_0	GC00010897	accepted	12	0	12	1	10	4,6,10,3,12,8,8,3,5,11,7,
simulated_read_1	GC00006032	accepted	10	0	10	1	10	8,7,14,14,4,4,7,10,3,
simulated_read_2	GC00006032	accepted	11	0	11	1	10	7,4,3,12,1,5,3,6,9,14,

Non-self-explanatory column description:

  • status: if the cluster was accepted or rejected. A cluster is rejected if nb_of_unique_mini < max(length_based_threshold, min_cluster_size);
  • nb_of_repeated_mini and nb_of_unique_mini: number of repeated and unique minimisers, respectively;
  • length_based_threshold: a threshold pandora calculates based on the PRG, read length, error rate, etc;
  • min_cluster_size: the parameter --min-cluster-size;
  • distances_between_hits: distance between minimizer hits in the read;

Clusters filter report (<sample>.clusters_filter_report)

This file is only produced with the --debugging-files option. This file describes which clusters pandora kept or filtered out. This is a filter on the clusters described in the clusters definition report. Below is an excerpt of a pandora .clusters_filter_report file. The columns are self-explanatory:

read	prg	cluster_size	status
simulated_read_0	GC00010897	12	kept
simulated_read_1	GC00006032	10	kept
simulated_read_2	GC00006032	11	kept

Minimizer PAF (<sample>.minipaf)

This file is only produced with the --debugging-files option. This file describes the minimizer alignment between a read and a PRG. When working on the base space, this is a standard file format, and its specification can be found here. The pandora .minipaf file is similar to the standard PAF, but works on the minimiser space, and it has less columns. Below is an excerpt of a pandora .minipaf file:

qname	qmlen	qmstart	qmend	qmstrands	prg	nmatch	alen	mapq
simulated_read_0	12	0	11	+	GC00010897	12	12	255
simulated_read_1	10	0	9	+	GC00006032	10	10	255
simulated_read_2	11	0	10	-	GC00006032	11	11	255

Non-self-explanatory column description:

  • qmlen: number of minimisers in the query sequence;
  • qmstart: query minimiser start coordinate (0-based);
  • qmend:query minimiser end coordinate (0-based);
  • qmstrands: + if query/target on the same strand; - if opposite; ? if can't be inferred;
  • nmatch: number of matching minimisers between query and prg in the alignment;
  • alen: total number of minimisers between query and prg in the alignment;
  • mapq: not available (always 255);

Discover

Racon input and output files

These files are only produced with the --debugging-files option. These files will be in the denovo directory inside the sample directory. They describe the input files to minimap2 and racon, and racon output files:

  • <gene>.consensus.fa: the pandora consensus for the specified <gene>. This is the sequence to be corrected by racon;
  • <gene>.reads.fa: the segments of the reads, flanked by 2*k bases, that pandora inferred that map to the specified <gene>;
  • <gene>.minimap2.out.paf: the result of mapping <gene>.reads.fa to <gene>.consensus.fa using minimap2;
  • <gene>.consensus.racon.<X>_rounds.fa: the resulting polished sequence by inputting <gene>.consensus.fa and <gene>.minimap2.out.paf to racon. <X> denotes the number of rounds racon was run;