This tutorial assumes a GNU/Linux system and that tatajuba is already installed. For installation instructions please refer to the README file in the entry page of the github repository.
We also assume that some tools are installed or that you can handle installing missing ones.
If you download the example_tutorial.txz file and expand it with
tar Jxvf example_tutorial.txz
It will generate a structure similar to below:
example_tutorial
├── data
│ ├── borde
│ │ ├── genes.gff
│ │ ├── sequences.fa
│ │ └── snpEffectPredictor.bin
│ └── campy
│ ├── genes.gff
│ ├── sequences.fa
│ └── snpEffectPredictor.bin
├── GCF_000148705.1_ASM14870v1_genomic.fna -> data/campy/sequences.fa
├── GCF_000148705.1_ASM14870v1_genomic.fna.amb
├── GCF_000148705.1_ASM14870v1_genomic.fna.ann
├── GCF_000148705.1_ASM14870v1_genomic.fna.bwt
├── GCF_000148705.1_ASM14870v1_genomic.fna.pac
├── GCF_000148705.1_ASM14870v1_genomic.fna.sa
├── GCF_000148705.1_ASM14870v1_genomic.gff -> data/campy/genes.gff
├── nohup.out
├── output_dir
│ ├── 1.vcf.gz
│ ├── 2.vcf.gz
│ ├── 4.vcf.gz
│ ├── 5.vcf.gz
│ ├── 7.vcf.gz
│ ├── per_sample_average_length.tsv
│ ├── per_sample_modal_frequency.tsv
│ ├── per_sample_proportional_coverage.tsv
│ ├── selected_tracts_annotated.tsv
│ ├── selected_tracts_unknown.tsv
│ ├── tract_list.tsv
│ └── variable_tracts.bed
├── reads
│ ├── ERR1701019.fastq
│ ├── ERR1701029.fastq
│ ├── ERR1701049.fastq
│ ├── ERR1701059.fastq
│ └── ERR1701079.fastq
├── snpEff.config
└── tutorial.md
Some files may be missing, as those generated by BWA (GCF_000148705.1_ASM14870v1_genomic.fna.*
) and tatajuba output,
but they can be reconstructed.
The file nohup.out
is the screen output (stderr + stdout) of the program run.
The data
directory and the snpEff.config
file are for snpEff. The reads
folder contains (severely subsampled)
read files to be used in this tutorial. The GCF_000148705.1_ASM14870v1_genomic.*
files are the GFF and FASTA files
used as the reference genome.
This data set is a small subset of the Campylobacter data used in the manuscript. The GFF3 and FASTA files are symbolic links to subdirectories (which are used by snpEff).
Tatajuba implements the BWA library for reference mapping, which relies on index
files it generates from the FASTA file if they are missing.
That is, tatajuba does not overwrite the .amb
, .ann
, .bwt
etc. files every time it runs, therefore if you modify
the fasta file please delete these index files so that tatajuba can reconstruct them.
These index files may be incompatible with the official BWA software, so please let tatajuba recreate them.
By the way, these index files are generated in the same directory as the fasta file, thus please make sure that the reference files are in a directory where you have write permissions. The FASTA file is not needed if and only if the sequences are embedded within the GFF file, as is the case for prokka's output. Tatajuba will then generate a fasta file from it (together with its index files).
For this tutorial, I have downloaded the references from the Assembly
database on NCBI, since I need both the genomic GFF and the FASTA files with corresponding names.
The RefSeq database also provides this, as well as obviously prokka
output.
The GFF format talks about contigs or chromosomes, which are the genomic FASTA sequences (genome or plasmid). Thus
tatajuba (and its documentation) use these words interchangeably.
If you want to use more than one reference genome, you can just concatenate the fasta files into one. For the GFF files
you may need to remove the first row with the header, and the last row with the ###
when concatenating the files.
The sequence-region
is mandatory before each new contig/chromosome, as output for instance by prokka
(e.g. ##sequence-region CP012149.1 1 1616662
).
Our suggestion is to use distinct enough genomes as references, as for instance from distinct species.
By using references too similar, the diversity signal from their HTs is lost since each homologue of the HT will be
treated as a distinct HT.
As described in the installation instructions you may have installed tatajuba with conda or from the source code. You may even have both, in which case you need to find which one you are using:
ubuntu@local$ locate bin/tatajuba # show possible locations of the executable
/home/ubuntu/local/bin/tatajuba
/home/ubuntu/miniconda3/bin/tatajuba
ubuntu@local$ # show which executable is default (i.e. which one is called if you just type "tatajuba")
ubuntu@local$ which tatajuba
/home/ubuntu/miniconda3/bin/tatajuba
If you want to choose explicitly the executable, you only need to write its full path:
ubuntu@local$ /home/ubuntu/bin/tatajuba
Error when reading arguments from command line:
tatajuba: missing option -g|--gff=<genome.gff3|genome.gff3.gz>
tatajuba: missing option <fastq files>
tatajuba 1.0.4
Compare histograms of homopolymeric tract lengths, within context.
The complete syntax is:
tatajuba [-h|--help] [-v|--version] [-p|--paired] [-b|--keep_bias] [-V|--vcf] [-k|--kmer={2,...,32}] [-m|--minsize={1,...,32}] [-i|--minreads=<int>] [-d|--maxdist=<int>] [-l|--leven=<int>] [-t|--nthreads=<int>] -g|--gff=<genome.gff3|genome.gff3.gz> [-f|--fasta=<genome.fna|genome.fna.gz>] <fastq files> [<fastq files>]... [-o|--outdir=<file>]
-h, --help print a longer help and exit
-v, --version print version and exit
-p, --paired paired end (pairs of) files
-b, --keep_bias keep biased tracts, i.e. present only in reverse or only in forward strains (default=remove)
-V, --vcf generate VCF files for each sample, around the HT regions (EXPERIMENTAL) (default=not to save)
-k, --kmer={2,...,32} kmer size flanking each side of homopolymer (default=25)
-m, --minsize={1,...,32} minimum homopolymer tract length to be compared (default=4)
-i, --minreads=<int> minimum number of reads for tract+context to be considered (default=5)
-d, --maxdist=<int> maximum distance between kmers of a flanking region to merge them into one context (default=1)
-l, --leven=<int> levenshtein distance between flanking regions to merge them into one context (after ref genome mapping)
-t, --nthreads=<int> suggested number of threads (default is to let system decide; I may not honour your suggestion btw)
-g, --gff=<genome.gff3|genome.gff3.gz> reference genome file in GFF3, preferencially with sequence
-f, --fasta=<genome.fna|genome.fna.gz> reference genome file in fasta format, if absent from GFF3
<fastq files> fastq file with reads (weirdly, fasta also possible as long as contains all reads and not only contigs)
-o, --outdir=<file> output directory, or 'random' for generating random dir name (default=current dir '.')
In the example above it is complaining that you need at least to provide a GFF3 file and the fastq files with the samples' reads. So if we provide the GFF3 file and the reads, it will now complain that the sequences are missing from this particular GFF3 file:
ubuntu@local$ /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff reads/ERR17010*
[ error ] No fasta provided and GFF3 file doesn't contain sequences
You must provide a fasta file with reference genome sequence(s) that match the GFF3 features, or you should find a GFF3 file with a '##FASTA' section at the end.
[note to developers] If you want to debug me, set a breakpoint on function biomcmc_error()
Which we know is a common situation. Luckily we also have the fasta file (GCF_000148705.1_ASM14870v1_genomic.fna
).
The reads used in this tutorial are single reads (and thus the -p
option is not needed), and since they are a small
subsample of the original files we will decresase the minimum HT coverage to 3 reads (-i 3
).
We also define the output directory to be output_dir
and we ask tatajuba to generate VCF files (-V
).
The full command would be:
ubuntu@local$ /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff -V \
-f GCF_000148705.1_ASM14870v1_genomic.fna -o output_dir -i 3 reads/ERR17010*
[warning] File 'output_dir/' already exists; I'll assume it's a directory. Contents will be overwritten
[warning] Decreasing number of threads to match number of samples
tatajuba 1.0.4
Reference genome fasta file: GCF_000148705.1_ASM14870v1_genomic.fna
Reference GFF3 file prefix: GCF_000148705.1_ASM14870v1_genomic
Output directory: output_dir/
Number of samples: 5 (single-end)
Max distance per flanking k-mer: 1
Levenshtein distance for merging: 2
Flanking k-mer size (context): 25
Min tract length to consider: 4
Min depth of tract lengths: 3
Remove biased tracts: yes
Number of threads (requested or optimised): 5
Assuming single-end samples
Read GFF3 reference genome in 0.088360 secs
processing file reads/ERR1701019.fastq
processing file reads/ERR1701059.fastq
processing file reads/ERR1701079.fastq
processing file reads/ERR1701049.fastq
processing file reads/ERR1701029.fastq
[bwa_aln_from_vector] 24.46399 sec
reads/ERR1701029.fastq: 21402 found and 7571 context+tracts were not found in reference
[bwa_aln_from_vector] 42.52276 sec
reads/ERR1701019.fastq: 47628 found and 3250 context+tracts were not found in reference
[bwa_aln_from_vector] 48.69241 sec
[bwa_aln_from_vector] 48.57682 sec
reads/ERR1701079.fastq: 51177 found and 2558 context+tracts were not found in reference
reads/ERR1701059.fastq: 54123 found and 1953 context+tracts were not found in reference
[bwa_aln_from_vector] 54.10467 sec
reads/ERR1701049.fastq: 52852 found and 2257 context+tracts were not found in reference
Modifying sample names: Removing prefix 'reads/ERR17010' from sample names.
Modifying sample names: Removing suffix '9.fastq' from sample names.
From 63473 tracts, 46969 interesting ones are annotated and 4058 interesting ones are not annotated
Will save VCF files with gzipped compression
Internal (threaded) timer:: 2.054656 secs to read and generate initial histograms
Internal (threaded) timer:: 59.129513 secs to merge and map histograms
Internal (threaded) timer:: 0.079142 secs to compare across sample genomes
Internal (threaded) timer:: 0.569195 secs to compare with reference
Non-threaded timing :: 16.337782 secs
In real analyses, it is suggested that the minimum read and the k-mer (flanking regions) size is increased, as in -i 8 -k 28
for instance.
You may furthermore want to run the software "detached" from the terminal wth the command nohup
,
such that you can disconnect etc. while making sure the program will finish:
ubuntu@local$ nohup /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff -V \
-f GCF_000148705.1_ASM14870v1_genomic.fna -o output_dir -i 3 reads/ERR17010*
All output will then go to the file nohup.out
instead of the terminal screen.
If you downloaded ("pulled") a docker container, you can run tatajuba with
ubuntu@local$ docker run -v `pwd`:`pwd` -w `pwd` quay.io/biocontainers/tatajuba:1.0.4--h5bf99c6_0 tatajuba \
-g GCF_000148705.1_ASM14870v1_genomic.gff -V -f GCF_000148705.1_ASM14870v1_genomic.fna \
-o output_dir -i 3 reads/ERR17010*
# replace "1.0.4--h5bf99c6_0" with the appropriate version
and equivalently if you have a singularity container you can run it with:
ubuntu@local$ singularity exec ~/Downloads/tatajuba1.0.4.sif tatajuba \
-g GCF_000148705.1_ASM14870v1_genomic.gff -V -f GCF_000148705.1_ASM14870v1_genomic.fna \
-o output_dir -i 3 reads/ERR17010*
Notice that, unlike docker, I'm providing the path to the container file I've built or downloaded, in this case
~/Downloads/tatajuba1.0.4.sif
.
[warning] File 'output_dir/' already exists; I'll assume it's a directory. Contents will be overwritten
[warning] Decreasing number of threads to match number of samples
We are first warned that the output directory already exists and therefore the contents may be overwritten. Since we only have five samples and this machine has more than 5 cores, it adjusted the number of threads. This is followed by a description of the parameters used, and thus it starts reading the read files in parallel (i.e. using several threads).
processing file reads/ERR1701029.fastq
[bwa_aln_from_vector] 24.46399 sec
reads/ERR1701029.fastq: 21402 found and 7571 context+tracts were not found in reference
Each sample (single or paired read files) will generate something like the 3 lines above: first to tell it is reading the file(s), then to let us know that it ran BWA-aln (this output is legacy from the original code), and finally how many HTs it found in the reference genomes and how many were not found (and thus will be discarded). If less than half of the detected HTs were mapped to the reference genomes, then a warning is given:
[warning] 123423 out of 215558 (more than half) context+tracts were not found in reference for sample reads/ERR999999.fastq
In which case you may want to check if you can add another reference for that sample, or if you want to remove it (rememering that a tract not present in all samples is automatically considered "variable"). If you have paired reads, the idescriptive line above would look something like
processing paired files ERR1701022_1.fastq.gz and ERR1701022_2.fastq.gz
And you can verify if tatajuba is using them in the desired order.
Modifying sample names: Removing prefix 'reads/ERR17010' from sample names.
Modifying sample names: Removing suffix '9.fastq' from sample names.
From 63473 tracts, 46969 interesting ones are annotated and 4058 interesting ones are not annotated
Tatajuba recognises that all files share a prefix and a suffix, and thus removes them.
The final samples are thus labelled 1
,2
,4
,5
, and 7
, which are the distinct name parts.
It also describes the total number of HTs, as well as the variable ones (using debug functions).
And finally it outputs the timings accumulated over threads (i.e. how much it would take using one thread) and the real, experienced clock timing.
The first thing we notice is that the BWA library generated a few index files based on the fasta file:
ubuntu@local$ ls -l GCF_000148705.1_ASM14870v1_genomic.fna*
lrwxrwxrwx 1 ubuntu ubuntu 23 Nov 8 09:47 GCF_000148705.1_ASM14870v1_genomic.fna -> data/campy/sequences.fa
-rw-r--r-- 1 ubuntu ubuntu 12 Nov 8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.amb
-rw-r--r-- 1 ubuntu ubuntu 96 Nov 8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.ann
-rw-r--r-- 1 ubuntu ubuntu 1616748 Nov 8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.bwt
-rw-r--r-- 1 ubuntu ubuntu 404164 Nov 8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.pac
-rw-r--r-- 1 ubuntu ubuntu 808376 Nov 8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.sa
These can be safely ignored, just keep in mind that although they are generated whenever they are missing, there are no clever checks:
- if you modify the FASTA file, delete all the index files;
- if you install a new version of tatajuba it is safe to delete all index files so that the new version can recreate them;
- if you delete only some index files, please delete all of them;
- if you cannot write new files in the directory where the fasta file is (for example a system-wide database folder), then move the fasta file to another location where you can.
The interesting output will be on the directory defined by --outdir
:
ubuntu@local$ ls -l output_dir/
total 25876
-rw-r--r-- 1 ubuntu ubuntu 553 Nov 17 12:22 1.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu 363 Nov 17 12:22 2.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu 611 Nov 17 12:22 4.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu 644 Nov 17 12:22 5.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu 590 Nov 17 12:22 7.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu 2889267 Nov 17 12:22 per_sample_average_length.tsv
-rw-r--r-- 1 ubuntu ubuntu 2889264 Nov 17 12:22 per_sample_modal_frequency.tsv
-rw-r--r-- 1 ubuntu ubuntu 3333041 Nov 17 12:22 per_sample_proportional_coverage.tsv
-rw-r--r-- 1 ubuntu ubuntu 4646987 Nov 17 12:22 selected_tracts_annotated.tsv
-rw-r--r-- 1 ubuntu ubuntu 324805 Nov 17 12:22 selected_tracts_unknown.tsv
-rw-r--r-- 1 ubuntu ubuntu 10453789 Nov 17 12:22 tract_list.tsv
-rw-r--r-- 1 ubuntu ubuntu 1921778 Nov 17 12:22 variable_tracts.bed
First we have the VCF files, one per sample, in standard gzip format (which does not have the BGZF extension needed by bcftools).
ubuntu@local$ zcat output_dir/1.vcf.gz | head
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=TID,Number=A,Type=String,Description="tract ID">
##contig=<ID=NC_017280.1,length=1616648>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1
NC_017280.1 74439 . T TT . . TID=tid_002700 GT 1
NC_017280.1 78446 . T TT . . TID=tid_002873 GT 1
NC_017280.1 87748 . T TT . . TID=tid_003213 GT 1
NC_017280.1 243121 . A AA . . TID=tid_009324 GT 1
NC_017280.1 254707 . A AA . . TID=tid_009755 GT 1
These VCF files are restricted to the homopolymeric tracts (not all variants), and the VCF rows only report locations where the sample differs from the reference in the HT region. In general the VCF files will thus contain far fewer HTs than the total number of variable ones. Therefore, in this example data set with low coverage fastq files, the VCF files will be quite short since only a fraction of the true HTs are recovered, and most of them will have same length as the reference.
In an extreme case, even if there is no HT length variability at all and all samples' HTs have same length as the reference, we can still have (1) a high number of reported variable HTs, and (2) a large number of overall HTs:
- an HT that is present in some and absent from some samples is still a variable HT: this list can be found in the
per_sample_*
files and in thevariable_tracts.bed
file; - the full list of HTs (file
tract_list.tsv
) describe HTs that are found in at least one sample; however no VCF file will describe them.
This is confirmed by looking at how many tracts were found in total, including or not invariant ones:
ubuntu@local$ wc -l output_dir/variable_tracts.bed output_dir/tract_list.tsv
51031 output_dir/variable_tracts.bed
63474 output_dir/tract_list.tsv
114505 total
The file variable_tracts.bed
contains the list of variable tracts in BED format (zero-based location of first and last bases in
the reference genomes spanned by each HT), and the tract_list.tsv
file contains all HTs, variable or not.
Thus we have 63k tracts in total, with most of them (51k) variable.
In our example we know that the missing HTs are due to low coverage, but tatajuba does not distinguish missing HTs
between low-coverage/failed QC ones, and truely missing tracts (as in monomers).
The files per_sample_*
will give us feature matrices with individual statistics per HT for each sample. For a complete
description of their columns please refer to the usage page.
If we peek into one of them, with the average tract length per sample, we will see that their occupancy is low (i.e.
most tracts are missing from at least one sample):
ubuntu@local$ head output_dir/per_sample_average_length.tsv
tract_id location feature reference 1 2 4 5 7
tid_000000 29 cds-WP_014516875.1 7 7.00 7.00 7.00
tid_000001 46 cds-WP_014516875.1 4 4.00 4.00 4.00
tid_000002 58 cds-WP_014516875.1 4 4.00 4.00 4.00
tid_000003 77 cds-WP_014516875.1 4 4.00 4.00
tid_000004 88 cds-WP_014516875.1 5 5.00 5.00 5.00 5.00
tid_000005 109 cds-WP_014516875.1 4 4.00 4.00 4.00
tid_000006 115 cds-WP_014516875.1 5 5.00 5.00 5.00 5.00
tid_000007 158 cds-WP_014516875.1 4 4.00 4.00 4.00 4.00
tid_000008 168 cds-WP_014516875.1 7 7.00 7.00 7.00 7.00
To see how these files can be further analysed, please take a look at this jupyter notebook in R.
When we looked at the VCF file for sample 1
(originally ERR1701019.fastq
file), the first HT was
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1
NC_017280.1 74439 . T TT . . TID=tid_002700 GT 1
This means that the HT tid_002700
describes a variant from T
in the reference to a TT
in the sample. In other
words, an insertion.
By the way, the label tid_002700
makes sense only within this run.
If we want to see how this HT looks like in other samples:
ubuntu@local$ head -1 output_dir/per_sample_average_length.tsv && grep tid_002700 output_dir/per_sample_average_length.tsv
tract_id location feature reference 1 2 4 5 7
tid_002700 74435 unannotated 4 5.00 5.00 5.00
This means that the tract has length 4 in the reference, and length 5 in samples 1
, 4
, and 7
. It is missing from
samples 2
and 5
.
Notice that the start of the HT is on position 74435 but the VCF reports the modification starting on site 74438 (74439
minus one if we make it zero-based) which makes sense since the first 4 bases of the homopolymer match.
And what does it look like, in context?
ubuntu@local$ head -1 output_dir/tract_list.tsv && grep tid_002700 output_dir/tract_list.tsv
tract_id contig_name feature_type feature location_in_contig max_tract_length ref_tract_length tract ref_tract
tid_002700 NC_017280.1 nc unannotated 74435 5 4 GTATTTTTAGAGAATAAAATAATAG.T.ATAAATAAATTTTAAAAAATGTATA GTATTTTTAGAGAATAAAATAATAG.T.ATAAATAAATTTTAAAAAATGTATA
It is outside a gene (unannotated
), and the most similar HT (with flanking regions) has the same DNA sequence as the
reference (last two columns). The longest lenght amongst samples is 5 while the reference has length 4.
Actually if we want to find HTs which have distinct flanking regions from the reference, we can:
ubuntu@local$ gawk '{if ($8!=$9){print $0}}' output_dir/tract_list.tsv | head
tract_id contig_name feature_type feature location_in_contig max_tract_length ref_tract_length tract ref_tract
tid_000229 NC_017280.1 nc unannotated 6573 5 5 ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAAATTAATTATTTTTAT ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAATTTAATTATTTTTAT
tid_000230 NC_017280.1 nc unannotated 6582 7 6 ATTGATTATATTAATCTTTTTGATT.A.TTAATTATTTTTATAATTTTGCATA ATTGATTATATTAATCTTTTTGATT.A.TTTAATTATTTTTATAATTTTGCAT
tid_000231 NC_017280.1 nc unannotated 6596 5 5 TCTTTTTGATTAAAAAAATTAATTA.T.ATAATTTTGCATATTAAAATAAAAT TCTTTTTGATTAAAAAATTTAATTA.T.ATAATTTTGCATATTAAAATAAAAT
tid_000232 NC_017280.1 nc unannotated 6605 4 4 TTAAAAAAATTAATTATTTTTATAA.T.GCATATTAAAATAAAATCATTATAT TTAAAAAATTTAATTATTTTTATAA.T.GCATATTAAAATAAAATCATTATAT
tid_002152 NC_017280.1 CDS cds-CJM1_RS00215 60683 4 4 CAACCACAAAAGAAATAAGCCCTGC.A.CCAAATTGCACCGTTCCAAGCACCG CAACCACAAAAGAAATAAGCCCTGC.A.CCAAATTGCACCGTTCAAAGCACCG
tid_002469 NC_017280.1 CDS cds-CJM1_RS00245 68265 4 4 AAAATTTAACAGAAGAGTTAAATAT.A.GATAAAAAAAACCAAGATAGTTTAA AAAATTTAACAGAAGAGTTAAATAT.A.GATAAAAAAACCAAGATAGTTTAAA
tid_002698 NC_017280.1 nc unannotated 74413 5 5 CGATATAAATTATTTTTATATCGTA.T.AGAGAATAAAATAATAGTTTTTATA CGATATAAATTATTTTTATATCGTA.T.AGAGAATAAAATAATAGTTTTATAA
tid_002699 NC_017280.1 nc unannotated 74425 4 4 TTTTTATATCGTATTTTTAGAGAAT.A.TAATAGTTTTTATAAATAAATTTTA TTTTTATATCGTATTTTTAGAGAAT.A.TAATAGTTTTATAAATAAATTTTAA
tid_002701 NC_017280.1 nc unannotated 74448 4 4 TAAAATAATAGTTTTTATAAATAAA.T.AAAAAATGTATAAAATTTTAACCAA ATAAAATAATAGTTTTATAAATAAA.T.AAAAAATGTATAAAATTTTAACCAA
It is hard to visualise from these rows, but if we align the sample and reference for the first HT described (tid_000229
), we have
ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAAATTAATTATTTTTAT (sample) ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAATTTAATTATTTTTAT (reference)
and for the second (tid_000230
):
ATTGATTATATTAATCTTTTTGATT.A. TTAATTATTTTTATAATTTTGCATA (sample) ATTGATTATATTAATCTTTTTGATT.A.TTTAATTATTTTTATAATTTTGCAT (reference)
Where we can see that they overlap: the right context of tid_000229
contains the HT tid_000230
, and the substitution can
be seen there.
Equivalently, the length change in tid_000230
(from 7 in a sample to 6 in the reference) can be described by the
substituion from T
to A
we've seen in the previous HT.
As we mention in the README file, there are a few tools available for exploring the functional effect of
mutations:
bcftools consequence calling,
vcf-anotator,
Ensembl Variant Effect Predictor (VEP), and
snpEff.
Most rely on the VCF file for the samples, together with the GFF3 and FASTA files for the reference genome.
We recommend snpEff
or VEP
, since bcftools csq
, supports only ENSEMBL GFF3 files.
vcf-anotator
requires the GenBank file for the reference (instead of GFF3) for annotations, but is an interesting
alternative.
Ideally the VCF files should be generated from the BAM/SAM files, i.e. using reference-based assembly programs
like minimap2 or bwa.
Snippy generates not only the BAM alignment and its corresponding snps.filt.vcf
files,
as it annotates the predicted effects with snpEff
into the snps.vcf
files.
Howver keep in mind that the "core SNPs" and further analyses from snippy-multi
by design will exclude most information from the HTs (which are indels).
Tatajuba generates a BED file which can be used to filter the regions harbouring homopolymeric tracts.
But it also experimentally outputs individual VCF files with minimal information (i.e. no quality information etc.).
These VCF files work well with snpEff
and bcftools
but remreber that the gzip format output by tatajuba is not
compatible with bcftools
.
Conversion is easy though, and if in doubt simply unconpress everything with gunzip *.gz
. The VCF files are not big.
In this example we will describe how to use snpEff
.
For instructions on how to use your own GFF3 file, which is almost always the case, see this discussion and
also the official documentation.
In a nutshell, you'll need a configuration file, which we call snpEff.config
, with locations of the databases.
In our case the relevant lines look like
data.dir = ./data/
campy.genome: campy
borde.genome: borde
This means that there are two genome databases called "campy" and "borde", which are in fact subdirectories below
./data/
.
As usual, we learned this by being one with the Torstenverse.
Within each of these directories, you'll need one GFF file, named genes.gff
, with its corresponding FASTA file,
called sequences.fa
.
If you have been playing with tatajuba you should already have these files 😉.
Then you tell snpEff
to generate the database from these files:
zcat ../myfiles/GCF_000148705.1_ASM14870v1_genomic.gff.gz > data/campy/genes.gff
zcat ../myfiles/GCF_000148705.1_ASM14870v1_genomic.fna.gz > data/campy/sequences.fa
snpEff build -c snpEff.config -gff3 campy
In the example above I copy them from two zipped files from another directory, but if you downloaded the example data set then the files should already be there, and with links in the current directory.
To actually run snpEff
you need to tell it where the configuration file is, and which database you want to use (in our
case, the freshly created "campy"):
snpEff ann -c snpEff.config campy concatenated.vcf > concatenated.annotated.vcf
In this example we are annotating all variants, across all samples since we are using the file hand_merged.vcf
from
above.
In this example we assume you want to have a summary of all variants across all samples (to get all possible mutational effects, for instance). One alternative is to merge all samples into a multi-sample VCF file. However a simpler alternative is to concatenate all VCF samples by hand, removing the duplicate events:
zcat somefile.vcf.gz | head -n 5 > concatenated.vcf
for i in *.vcf.gz; do zcat $i | tail -n +6; done | sort -nk 2 | uniq >> concatenated.vcf
The numbers "5" and "6" above relate to the number of header lines (you nay need to check one file by eye,
or replace the head
by a grep
command).
You can also normalise the VCF files before concatenating them by hand.
In a VCF file, after the special keyword headers (starting with ##
), VCF needs a file with the column description.
If you want to rename the sample to hightlight the fact that this is a dummy sample, a concatenation of all samples,
you thus can modify this line just before the body of the file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT all_samples
Tatajuba will output one VCF file per sample, but sometimes we are interested in having one file with all variants from
all samples together.
This can be done with bcftools merge
, which generated one multi-sample file from several individual ones.
In the example below we assume that there are several vcf.gz
files and the reference genome
GCF_000148705.1_ASM14870v1_genomic.fna
in FASTA format in the same directory:
# uncompress the files since the standard zlib is not compatible with the expected BGZF format
gunzip *.vcf.gz
# normalise the files to make sure modifications are comparable
for i in *.vcf; do bcftools norm -f GCF_000148705.1_ASM14870v1_genomic.fna ${i} > norm.$i; done
# compress the files using the required BGZF format
for i in norm.*.vcf; do bgzip $i; done
# create an index for each file
for i in norm.*; do bcftools index $i; done
# merge the normalised files into one
bcftools merge -o bcf_merged.vcf norm.*.gz
In this example we also use bcftools norm
to "fix" each VCF file
(by left-aligning indels etc.).