Tutorial

This tutorial assumes a GNU/Linux system and that tatajuba is already installed. For installation instructions please refer to the README file in the entry page of the github repository.

We also assume that some tools are installed or that you can handle installing missing ones.

Downloading the data

If you download the example_tutorial.txz file and expand it with

tar Jxvf example_tutorial.txz

It will generate a structure similar to below:

example_tutorial
├── data
│   ├── borde
│   │   ├── genes.gff
│   │   ├── sequences.fa
│   │   └── snpEffectPredictor.bin
│   └── campy
│       ├── genes.gff
│       ├── sequences.fa
│       └── snpEffectPredictor.bin
├── GCF_000148705.1_ASM14870v1_genomic.fna -> data/campy/sequences.fa
├── GCF_000148705.1_ASM14870v1_genomic.fna.amb
├── GCF_000148705.1_ASM14870v1_genomic.fna.ann
├── GCF_000148705.1_ASM14870v1_genomic.fna.bwt
├── GCF_000148705.1_ASM14870v1_genomic.fna.pac
├── GCF_000148705.1_ASM14870v1_genomic.fna.sa
├── GCF_000148705.1_ASM14870v1_genomic.gff -> data/campy/genes.gff
├── nohup.out
├── output_dir
│   ├── 1.vcf.gz
│   ├── 2.vcf.gz
│   ├── 4.vcf.gz
│   ├── 5.vcf.gz
│   ├── 7.vcf.gz
│   ├── per_sample_average_length.tsv
│   ├── per_sample_modal_frequency.tsv
│   ├── per_sample_proportional_coverage.tsv
│   ├── selected_tracts_annotated.tsv
│   ├── selected_tracts_unknown.tsv
│   ├── tract_list.tsv
│   └── variable_tracts.bed
├── reads
│   ├── ERR1701019.fastq
│   ├── ERR1701029.fastq
│   ├── ERR1701049.fastq
│   ├── ERR1701059.fastq
│   └── ERR1701079.fastq
├── snpEff.config
└── tutorial.md

Some files may be missing, as those generated by BWA (GCF_000148705.1_ASM14870v1_genomic.fna.*) and tatajuba output, but they can be reconstructed. The file nohup.out is the screen output (stderr + stdout) of the program run.

The data directory and the snpEff.config file are for snpEff. The reads folder contains (severely subsampled) read files to be used in this tutorial. The GCF_000148705.1_ASM14870v1_genomic.* files are the GFF and FASTA files used as the reference genome.

This data set is a small subset of the Campylobacter data used in the manuscript. The GFF3 and FASTA files are symbolic links to subdirectories (which are used by snpEff).

Running tatajuba

The reference genome files

Tatajuba implements the BWA library for reference mapping, which relies on index files it generates from the FASTA file if they are missing. That is, tatajuba does not overwrite the .amb, .ann, .bwt etc. files every time it runs, therefore if you modify the fasta file please delete these index files so that tatajuba can reconstruct them. These index files may be incompatible with the official BWA software, so please let tatajuba recreate them.

By the way, these index files are generated in the same directory as the fasta file, thus please make sure that the reference files are in a directory where you have write permissions. The FASTA file is not needed if and only if the sequences are embedded within the GFF file, as is the case for prokka's output. Tatajuba will then generate a fasta file from it (together with its index files).

For this tutorial, I have downloaded the references from the Assembly database on NCBI, since I need both the genomic GFF and the FASTA files with corresponding names. The RefSeq database also provides this, as well as obviously prokka output. The GFF format talks about contigs or chromosomes, which are the genomic FASTA sequences (genome or plasmid). Thus tatajuba (and its documentation) use these words interchangeably. If you want to use more than one reference genome, you can just concatenate the fasta files into one. For the GFF files you may need to remove the first row with the header, and the last row with the ### when concatenating the files. The sequence-region is mandatory before each new contig/chromosome, as output for instance by prokka (e.g. ##sequence-region CP012149.1 1 1616662). Our suggestion is to use distinct enough genomes as references, as for instance from distinct species. By using references too similar, the diversity signal from their HTs is lost since each homologue of the HT will be treated as a distinct HT.

The command line

As described in the installation instructions you may have installed tatajuba with conda or from the source code. You may even have both, in which case you need to find which one you are using:

ubuntu@local$ locate bin/tatajuba # show possible locations of the executable
/home/ubuntu/local/bin/tatajuba
/home/ubuntu/miniconda3/bin/tatajuba
ubuntu@local$  # show which executable is default (i.e. which one is called if you just type "tatajuba")
ubuntu@local$ which tatajuba 
/home/ubuntu/miniconda3/bin/tatajuba

If you want to choose explicitly the executable, you only need to write its full path:

ubuntu@local$ /home/ubuntu/bin/tatajuba
Error when reading arguments from command line:
tatajuba: missing option -g|--gff=<genome.gff3|genome.gff3.gz>
tatajuba: missing option <fastq files>

tatajuba 1.0.4
Compare histograms of homopolymeric tract lengths, within context.
The complete syntax is:

 tatajuba  [-h|--help] [-v|--version] [-p|--paired] [-b|--keep_bias] [-V|--vcf] [-k|--kmer={2,...,32}] [-m|--minsize={1,...,32}] [-i|--minreads=<int>] [-d|--maxdist=<int>] [-l|--leven=<int>] [-t|--nthreads=<int>] -g|--gff=<genome.gff3|genome.gff3.gz> [-f|--fasta=<genome.fna|genome.fna.gz>] <fastq files> [<fastq files>]... [-o|--outdir=<file>]

  -h, --help                       print a longer help and exit
  -v, --version                    print version and exit
  -p, --paired                     paired end (pairs of) files
  -b, --keep_bias                  keep biased tracts, i.e. present only in reverse or only in forward strains (default=remove)
  -V, --vcf                        generate VCF files for each sample, around the HT regions (EXPERIMENTAL) (default=not to save)
  -k, --kmer={2,...,32}            kmer size flanking each side of homopolymer (default=25)
  -m, --minsize={1,...,32}         minimum homopolymer tract length to be compared (default=4)
  -i, --minreads=<int>             minimum number of reads for tract+context to be considered (default=5)
  -d, --maxdist=<int>              maximum distance between kmers of a flanking region to merge them into one context (default=1)
  -l, --leven=<int>                levenshtein distance between flanking regions to merge them into one context (after ref genome mapping)
  -t, --nthreads=<int>             suggested number of threads (default is to let system decide; I may not honour your suggestion btw)
  -g, --gff=<genome.gff3|genome.gff3.gz> reference genome file in GFF3, preferencially with sequence
  -f, --fasta=<genome.fna|genome.fna.gz> reference genome file in fasta format, if absent from GFF3
  <fastq files>                    fastq file with reads (weirdly, fasta also possible as long as contains all reads and not only contigs)
  -o, --outdir=<file>              output directory, or 'random' for generating random dir name (default=current dir '.')

In the example above it is complaining that you need at least to provide a GFF3 file and the fastq files with the samples' reads. So if we provide the GFF3 file and the reads, it will now complain that the sequences are missing from this particular GFF3 file:

ubuntu@local$ /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff reads/ERR17010*
[ error ] No fasta provided and GFF3 file doesn't contain sequences
 You must provide a fasta file with reference genome sequence(s) that match the GFF3 features, or you should find a GFF3 file with a '##FASTA' section at the end.

[note to developers] If you want to debug me, set a breakpoint on function biomcmc_error()

Which we know is a common situation. Luckily we also have the fasta file (GCF_000148705.1_ASM14870v1_genomic.fna). The reads used in this tutorial are single reads (and thus the -p option is not needed), and since they are a small subsample of the original files we will decresase the minimum HT coverage to 3 reads (-i 3). We also define the output directory to be output_dir and we ask tatajuba to generate VCF files (-V). The full command would be:

ubuntu@local$ /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff -V \
              -f GCF_000148705.1_ASM14870v1_genomic.fna -o output_dir -i 3  reads/ERR17010*
[warning] File 'output_dir/' already exists; I'll assume it's a directory. Contents will be overwritten
[warning] Decreasing number of threads to match number of samples
tatajuba 1.0.4
Reference genome fasta file: GCF_000148705.1_ASM14870v1_genomic.fna
Reference GFF3 file prefix:  GCF_000148705.1_ASM14870v1_genomic
Output directory:            output_dir/
Number of samples:               5 (single-end)
Max distance per flanking k-mer:       1
Levenshtein distance for merging:      2
Flanking k-mer size (context):        25
Min tract length to consider:          4
Min depth of tract lengths:            3
Remove biased tracts:             yes
Number of threads (requested or optimised):   5
Assuming single-end samples
Read GFF3 reference genome in        0.088360 secs

processing file reads/ERR1701019.fastq
processing file reads/ERR1701059.fastq
processing file reads/ERR1701079.fastq
processing file reads/ERR1701049.fastq
processing file reads/ERR1701029.fastq
[bwa_aln_from_vector] 24.46399 sec
reads/ERR1701029.fastq:  21402 found and   7571 context+tracts were not found in reference
[bwa_aln_from_vector] 42.52276 sec
reads/ERR1701019.fastq:  47628 found and   3250 context+tracts were not found in reference
[bwa_aln_from_vector] 48.69241 sec
[bwa_aln_from_vector] 48.57682 sec
reads/ERR1701079.fastq:  51177 found and   2558 context+tracts were not found in reference
reads/ERR1701059.fastq:  54123 found and   1953 context+tracts were not found in reference
[bwa_aln_from_vector] 54.10467 sec
reads/ERR1701049.fastq:  52852 found and   2257 context+tracts were not found in reference
Modifying sample names: Removing prefix 'reads/ERR17010' from sample names.
Modifying sample names: Removing suffix '9.fastq' from sample names.
From 63473 tracts, 46969 interesting ones are annotated and 4058 interesting ones are not annotated
Will save VCF files with gzipped compression
Internal (threaded) timer::        2.054656 secs to read and generate initial histograms
Internal (threaded) timer::       59.129513 secs to merge and map histograms
Internal (threaded) timer::        0.079142 secs to compare across sample genomes
Internal (threaded) timer::        0.569195 secs to compare with reference
Non-threaded timing      ::       16.337782 secs

In real analyses, it is suggested that the minimum read and the k-mer (flanking regions) size is increased, as in -i 8 -k 28 for instance. You may furthermore want to run the software "detached" from the terminal wth the command nohup, such that you can disconnect etc. while making sure the program will finish:

ubuntu@local$ nohup /home/ubuntu/bin/tatajuba -g GCF_000148705.1_ASM14870v1_genomic.gff -V \
              -f GCF_000148705.1_ASM14870v1_genomic.fna -o output_dir -i 3  reads/ERR17010*

All output will then go to the file nohup.out instead of the terminal screen.

If you downloaded ("pulled") a docker container, you can run tatajuba with

ubuntu@local$ docker run -v `pwd`:`pwd` -w `pwd` quay.io/biocontainers/tatajuba:1.0.4--h5bf99c6_0 tatajuba \
              -g GCF_000148705.1_ASM14870v1_genomic.gff -V -f GCF_000148705.1_ASM14870v1_genomic.fna \
              -o output_dir -i 3  reads/ERR17010*
# replace "1.0.4--h5bf99c6_0" with the appropriate version

and equivalently if you have a singularity container you can run it with:

ubuntu@local$ singularity exec ~/Downloads/tatajuba1.0.4.sif tatajuba \ 
              -g GCF_000148705.1_ASM14870v1_genomic.gff -V -f GCF_000148705.1_ASM14870v1_genomic.fna \
              -o output_dir -i 3  reads/ERR17010*

Notice that, unlike docker, I'm providing the path to the container file I've built or downloaded, in this case ~/Downloads/tatajuba1.0.4.sif.

Interpreting the screen output

[warning] File 'output_dir/' already exists; I'll assume it's a directory. Contents will be overwritten
[warning] Decreasing number of threads to match number of samples

We are first warned that the output directory already exists and therefore the contents may be overwritten. Since we only have five samples and this machine has more than 5 cores, it adjusted the number of threads. This is followed by a description of the parameters used, and thus it starts reading the read files in parallel (i.e. using several threads).

processing file reads/ERR1701029.fastq
[bwa_aln_from_vector] 24.46399 sec
reads/ERR1701029.fastq:  21402 found and   7571 context+tracts were not found in reference

Each sample (single or paired read files) will generate something like the 3 lines above: first to tell it is reading the file(s), then to let us know that it ran BWA-aln (this output is legacy from the original code), and finally how many HTs it found in the reference genomes and how many were not found (and thus will be discarded). If less than half of the detected HTs were mapped to the reference genomes, then a warning is given:

[warning] 123423 out of 215558 (more than half) context+tracts were not found in reference for sample  reads/ERR999999.fastq

In which case you may want to check if you can add another reference for that sample, or if you want to remove it (rememering that a tract not present in all samples is automatically considered "variable"). If you have paired reads, the idescriptive line above would look something like

processing paired files ERR1701022_1.fastq.gz and ERR1701022_2.fastq.gz

And you can verify if tatajuba is using them in the desired order.

Modifying sample names: Removing prefix 'reads/ERR17010' from sample names.
Modifying sample names: Removing suffix '9.fastq' from sample names.
From 63473 tracts, 46969 interesting ones are annotated and 4058 interesting ones are not annotated

Tatajuba recognises that all files share a prefix and a suffix, and thus removes them. The final samples are thus labelled 1,2,4,5, and 7, which are the distinct name parts. It also describes the total number of HTs, as well as the variable ones (using debug functions).

And finally it outputs the timings accumulated over threads (i.e. how much it would take using one thread) and the real, experienced clock timing.

The output

The first thing we notice is that the BWA library generated a few index files based on the fasta file:

ubuntu@local$ ls -l GCF_000148705.1_ASM14870v1_genomic.fna*
lrwxrwxrwx 1 ubuntu ubuntu      23 Nov  8 09:47 GCF_000148705.1_ASM14870v1_genomic.fna -> data/campy/sequences.fa
-rw-r--r-- 1 ubuntu ubuntu      12 Nov  8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.amb
-rw-r--r-- 1 ubuntu ubuntu      96 Nov  8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.ann
-rw-r--r-- 1 ubuntu ubuntu 1616748 Nov  8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.bwt
-rw-r--r-- 1 ubuntu ubuntu  404164 Nov  8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.pac
-rw-r--r-- 1 ubuntu ubuntu  808376 Nov  8 10:07 GCF_000148705.1_ASM14870v1_genomic.fna.sa

These can be safely ignored, just keep in mind that although they are generated whenever they are missing, there are no clever checks:

if you modify the FASTA file, delete all the index files;
if you install a new version of tatajuba it is safe to delete all index files so that the new version can recreate them;
if you delete only some index files, please delete all of them;
if you cannot write new files in the directory where the fasta file is (for example a system-wide database folder), then move the fasta file to another location where you can.

The interesting output will be on the directory defined by --outdir:

ubuntu@local$ ls -l output_dir/
total 25876
-rw-r--r-- 1 ubuntu ubuntu      553 Nov 17 12:22 1.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu      363 Nov 17 12:22 2.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu      611 Nov 17 12:22 4.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu      644 Nov 17 12:22 5.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu      590 Nov 17 12:22 7.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu  2889267 Nov 17 12:22 per_sample_average_length.tsv
-rw-r--r-- 1 ubuntu ubuntu  2889264 Nov 17 12:22 per_sample_modal_frequency.tsv
-rw-r--r-- 1 ubuntu ubuntu  3333041 Nov 17 12:22 per_sample_proportional_coverage.tsv
-rw-r--r-- 1 ubuntu ubuntu  4646987 Nov 17 12:22 selected_tracts_annotated.tsv
-rw-r--r-- 1 ubuntu ubuntu   324805 Nov 17 12:22 selected_tracts_unknown.tsv
-rw-r--r-- 1 ubuntu ubuntu 10453789 Nov 17 12:22 tract_list.tsv
-rw-r--r-- 1 ubuntu ubuntu  1921778 Nov 17 12:22 variable_tracts.bed

First we have the VCF files, one per sample, in standard gzip format (which does not have the BGZF extension needed by bcftools).

ubuntu@local$ zcat output_dir/1.vcf.gz  | head
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=TID,Number=A,Type=String,Description="tract ID">
##contig=<ID=NC_017280.1,length=1616648>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
NC_017280.1     74439   .       T       TT      .       .       TID=tid_002700  GT      1
NC_017280.1     78446   .       T       TT      .       .       TID=tid_002873  GT      1
NC_017280.1     87748   .       T       TT      .       .       TID=tid_003213  GT      1
NC_017280.1     243121  .       A       AA      .       .       TID=tid_009324  GT      1
NC_017280.1     254707  .       A       AA      .       .       TID=tid_009755  GT      1

These VCF files are restricted to the homopolymeric tracts (not all variants), and the VCF rows only report locations where the sample differs from the reference in the HT region. In general the VCF files will thus contain far fewer HTs than the total number of variable ones. Therefore, in this example data set with low coverage fastq files, the VCF files will be quite short since only a fraction of the true HTs are recovered, and most of them will have same length as the reference.

In an extreme case, even if there is no HT length variability at all and all samples' HTs have same length as the reference, we can still have (1) a high number of reported variable HTs, and (2) a large number of overall HTs:

an HT that is present in some and absent from some samples is still a variable HT: this list can be found in the per_sample_* files and in the variable_tracts.bed file;
the full list of HTs (file tract_list.tsv) describe HTs that are found in at least one sample; however no VCF file will describe them.

This is confirmed by looking at how many tracts were found in total, including or not invariant ones:

ubuntu@local$ wc -l output_dir/variable_tracts.bed output_dir/tract_list.tsv
   51031 output_dir/variable_tracts.bed
   63474 output_dir/tract_list.tsv
  114505 total

The file variable_tracts.bed contains the list of variable tracts in BED format (zero-based location of first and last bases in the reference genomes spanned by each HT), and the tract_list.tsv file contains all HTs, variable or not. Thus we have 63k tracts in total, with most of them (51k) variable. In our example we know that the missing HTs are due to low coverage, but tatajuba does not distinguish missing HTs between low-coverage/failed QC ones, and truely missing tracts (as in monomers).

The files per_sample_* will give us feature matrices with individual statistics per HT for each sample. For a complete description of their columns please refer to the usage page. If we peek into one of them, with the average tract length per sample, we will see that their occupancy is low (i.e. most tracts are missing from at least one sample):

ubuntu@local$ head output_dir/per_sample_average_length.tsv
tract_id        location        feature reference       1       2       4       5       7
tid_000000      29      cds-WP_014516875.1      7               7.00    7.00    7.00
tid_000001      46      cds-WP_014516875.1      4               4.00    4.00    4.00
tid_000002      58      cds-WP_014516875.1      4               4.00    4.00    4.00
tid_000003      77      cds-WP_014516875.1      4                       4.00    4.00
tid_000004      88      cds-WP_014516875.1      5       5.00            5.00    5.00    5.00
tid_000005      109     cds-WP_014516875.1      4                       4.00    4.00    4.00
tid_000006      115     cds-WP_014516875.1      5       5.00            5.00    5.00    5.00
tid_000007      158     cds-WP_014516875.1      4       4.00            4.00    4.00    4.00
tid_000008      168     cds-WP_014516875.1      7       7.00            7.00    7.00    7.00

To see how these files can be further analysed, please take a look at this jupyter notebook in R.

Information about a particular tract

When we looked at the VCF file for sample 1 (originally ERR1701019.fastq file), the first HT was

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
NC_017280.1     74439   .       T       TT      .       .       TID=tid_002700  GT      1

This means that the HT tid_002700 describes a variant from T in the reference to a TT in the sample. In other words, an insertion. By the way, the label tid_002700 makes sense only within this run.

If we want to see how this HT looks like in other samples:

ubuntu@local$ head -1 output_dir/per_sample_average_length.tsv && grep tid_002700 output_dir/per_sample_average_length.tsv
tract_id        location        feature reference       1       2       4       5       7
tid_002700      74435   unannotated     4       5.00            5.00            5.00

This means that the tract has length 4 in the reference, and length 5 in samples 1, 4, and 7. It is missing from samples 2 and 5. Notice that the start of the HT is on position 74435 but the VCF reports the modification starting on site 74438 (74439 minus one if we make it zero-based) which makes sense since the first 4 bases of the homopolymer match.

And what does it look like, in context?

ubuntu@local$ head -1 output_dir/tract_list.tsv && grep tid_002700 output_dir/tract_list.tsv
tract_id        contig_name     feature_type    feature location_in_contig      max_tract_length        ref_tract_length                tract   ref_tract
tid_002700      NC_017280.1     nc      unannotated     74435   5       4       GTATTTTTAGAGAATAAAATAATAG.T.ATAAATAAATTTTAAAAAATGTATA   GTATTTTTAGAGAATAAAATAATAG.T.ATAAATAAATTTTAAAAAATGTATA

It is outside a gene (unannotated), and the most similar HT (with flanking regions) has the same DNA sequence as the reference (last two columns). The longest lenght amongst samples is 5 while the reference has length 4. Actually if we want to find HTs which have distinct flanking regions from the reference, we can:

ubuntu@local$ gawk '{if ($8!=$9){print $0}}' output_dir/tract_list.tsv  | head
tract_id        contig_name     feature_type    feature location_in_contig      max_tract_length        ref_tract_length                tract   ref_tract
tid_000229      NC_017280.1     nc      unannotated     6573    5       5       ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAAATTAATTATTTTTAT   ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAATTTAATTATTTTTAT
tid_000230      NC_017280.1     nc      unannotated     6582    7       6       ATTGATTATATTAATCTTTTTGATT.A.TTAATTATTTTTATAATTTTGCATA   ATTGATTATATTAATCTTTTTGATT.A.TTTAATTATTTTTATAATTTTGCAT
tid_000231      NC_017280.1     nc      unannotated     6596    5       5       TCTTTTTGATTAAAAAAATTAATTA.T.ATAATTTTGCATATTAAAATAAAAT   TCTTTTTGATTAAAAAATTTAATTA.T.ATAATTTTGCATATTAAAATAAAAT
tid_000232      NC_017280.1     nc      unannotated     6605    4       4       TTAAAAAAATTAATTATTTTTATAA.T.GCATATTAAAATAAAATCATTATAT   TTAAAAAATTTAATTATTTTTATAA.T.GCATATTAAAATAAAATCATTATAT
tid_002152      NC_017280.1     CDS     cds-CJM1_RS00215        60683   4       4       CAACCACAAAAGAAATAAGCCCTGC.A.CCAAATTGCACCGTTCCAAGCACCG   CAACCACAAAAGAAATAAGCCCTGC.A.CCAAATTGCACCGTTCAAAGCACCG
tid_002469      NC_017280.1     CDS     cds-CJM1_RS00245        68265   4       4       AAAATTTAACAGAAGAGTTAAATAT.A.GATAAAAAAAACCAAGATAGTTTAA   AAAATTTAACAGAAGAGTTAAATAT.A.GATAAAAAAACCAAGATAGTTTAAA
tid_002698      NC_017280.1     nc      unannotated     74413   5       5       CGATATAAATTATTTTTATATCGTA.T.AGAGAATAAAATAATAGTTTTTATA   CGATATAAATTATTTTTATATCGTA.T.AGAGAATAAAATAATAGTTTTATAA
tid_002699      NC_017280.1     nc      unannotated     74425   4       4       TTTTTATATCGTATTTTTAGAGAAT.A.TAATAGTTTTTATAAATAAATTTTA   TTTTTATATCGTATTTTTAGAGAAT.A.TAATAGTTTTATAAATAAATTTTAA
tid_002701      NC_017280.1     nc      unannotated     74448   4       4       TAAAATAATAGTTTTTATAAATAAA.T.AAAAAATGTATAAAATTTTAACCAA   ATAAAATAATAGTTTTATAAATAAA.T.AAAAAATGTATAAAATTTTAACCAA

It is hard to visualise from these rows, but if we align the sample and reference for the first HT described (tid_000229), we have

ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAAATTAATTATTTTTAT (sample)
ATTTAAATAATTGATTATATTAATC.T.GATTAAAAAATTTAATTATTTTTAT (reference)

and for the second (tid_000230):

ATTGATTATATTAATCTTTTTGATT.A. TTAATTATTTTTATAATTTTGCATA (sample)
ATTGATTATATTAATCTTTTTGATT.A.TTTAATTATTTTTATAATTTTGCAT (reference)

Where we can see that they overlap: the right context of tid_000229 contains the HT tid_000230, and the substitution can be seen there. Equivalently, the length change in tid_000230 (from 7 in a sample to 6 in the reference) can be described by the substituion from T to A we've seen in the previous HT.

Downstream analyses

Annotating the VCF files with variant effect information

As we mention in the README file, there are a few tools available for exploring the functional effect of mutations: bcftools consequence calling, vcf-anotator,
Ensembl Variant Effect Predictor (VEP), and snpEff. Most rely on the VCF file for the samples, together with the GFF3 and FASTA files for the reference genome. We recommend snpEff or VEP, since bcftools csq, supports only ENSEMBL GFF3 files. vcf-anotator requires the GenBank file for the reference (instead of GFF3) for annotations, but is an interesting alternative.

Ideally the VCF files should be generated from the BAM/SAM files, i.e. using reference-based assembly programs like minimap2 or bwa. Snippy generates not only the BAM alignment and its corresponding snps.filt.vcf files, as it annotates the predicted effects with snpEff into the snps.vcf files. Howver keep in mind that the "core SNPs" and further analyses from snippy-multi by design will exclude most information from the HTs (which are indels).

Tatajuba generates a BED file which can be used to filter the regions harbouring homopolymeric tracts. But it also experimentally outputs individual VCF files with minimal information (i.e. no quality information etc.). These VCF files work well with snpEff and bcftools but remreber that the gzip format output by tatajuba is not compatible with bcftools. Conversion is easy though, and if in doubt simply unconpress everything with gunzip *.gz. The VCF files are not big.

In this example we will describe how to use snpEff.

creating a database

For instructions on how to use your own GFF3 file, which is almost always the case, see this discussion and also the official documentation. In a nutshell, you'll need a configuration file, which we call snpEff.config, with locations of the databases. In our case the relevant lines look like

data.dir = ./data/
campy.genome: campy
borde.genome: borde

This means that there are two genome databases called "campy" and "borde", which are in fact subdirectories below ./data/. As usual, we learned this by being one with the Torstenverse.

Within each of these directories, you'll need one GFF file, named genes.gff, with its corresponding FASTA file, called sequences.fa. If you have been playing with tatajuba you should already have these files 😉. Then you tell snpEff to generate the database from these files:

zcat ../myfiles/GCF_000148705.1_ASM14870v1_genomic.gff.gz > data/campy/genes.gff
zcat ../myfiles/GCF_000148705.1_ASM14870v1_genomic.fna.gz > data/campy/sequences.fa
snpEff build -c snpEff.config -gff3 campy

In the example above I copy them from two zipped files from another directory, but if you downloaded the example data set then the files should already be there, and with links in the current directory.

running snpEff

To actually run snpEff you need to tell it where the configuration file is, and which database you want to use (in our case, the freshly created "campy"):

snpEff ann -c snpEff.config campy concatenated.vcf > concatenated.annotated.vcf

In this example we are annotating all variants, across all samples since we are using the file hand_merged.vcf from above.

Concatenating all variants from the VCF files

In this example we assume you want to have a summary of all variants across all samples (to get all possible mutational effects, for instance). One alternative is to merge all samples into a multi-sample VCF file. However a simpler alternative is to concatenate all VCF samples by hand, removing the duplicate events:

zcat somefile.vcf.gz | head -n 5 > concatenated.vcf
for i in *.vcf.gz; do zcat $i | tail -n +6; done |  sort -nk 2 | uniq >> concatenated.vcf

The numbers "5" and "6" above relate to the number of header lines (you nay need to check one file by eye, or replace the head by a grep command). You can also normalise the VCF files before concatenating them by hand.

In a VCF file, after the special keyword headers (starting with ##), VCF needs a file with the column description. If you want to rename the sample to hightlight the fact that this is a dummy sample, a concatenation of all samples, you thus can modify this line just before the body of the file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  all_samples

Merging samples with bcftools

Tatajuba will output one VCF file per sample, but sometimes we are interested in having one file with all variants from all samples together. This can be done with bcftools merge, which generated one multi-sample file from several individual ones.

In the example below we assume that there are several vcf.gz files and the reference genome GCF_000148705.1_ASM14870v1_genomic.fna in FASTA format in the same directory:

# uncompress the files since the standard zlib is not compatible with the expected BGZF format
gunzip *.vcf.gz
# normalise the files to make sure modifications are comparable
for i in *.vcf; do bcftools norm -f GCF_000148705.1_ASM14870v1_genomic.fna ${i} > norm.$i; done
# compress the files using the required BGZF format
for i in norm.*.vcf; do bgzip $i; done
# create an index for each file
for i in norm.*; do bcftools index $i; done
# merge the normalised files into one
bcftools merge -o bcf_merged.vcf norm.*.gz

In this example we also use bcftools norm to "fix" each VCF file (by left-aligning indels etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial.md

tutorial.md

Tutorial

Downloading the data

Running tatajuba

The reference genome files

The command line

Interpreting the screen output

The output

Information about a particular tract

Downstream analyses

Annotating the VCF files with variant effect information

creating a database

running snpEff

Concatenating all variants from the VCF files

Merging samples with bcftools

Files

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls

Tutorial

Downloading the data

Running tatajuba

The reference genome files

The command line

Interpreting the screen output

The output

Information about a particular tract

Downstream analyses

Annotating the VCF files with variant effect information

creating a database

running snpEff

Concatenating all variants from the VCF files

Merging samples with bcftools