Here is a hands-on tutorial on how to run tatajuba, and how to analyse its output.
You can also download all files necessary for this example on example_tutorial.txz.
It is archived with tar
and xz
, which means that after downloading you need to
tar Jxvf example_tutorial.txz
The file 211108.figures_snippy_comparison.ipynb contains the jupyter
notebook for the
updated Campylobacter and Bordetella homopolymer tract analyses initially for the manuscript
https://doi.org/10.1101/2021.06.02.446710.
You must also download 211108.supplementary_material.txz to obtain the output
files, extra files, and the supplementary tables with the accession IDs of the fastq
files used in this work.
The main README file contains installation instructions.
the basic help can be invoked with tatajuba -h
, which will look something like
tatajuba 1.0.4
Compare histograms of homopolymeric tract lengths, within context.
The complete syntax is:
tatajuba [-h|--help] [-v|--version] [-p|--paired] [-b|--keep_bias] [-V|--vcf] [-k|--kmer={2,...,32}] [-m|--minsize={1,...,32}] [-i|--minreads=<int>] [-d|--maxdist=<int>] [-l|--leven=<int>] [-t|--nthreads=<int>]
-g|--gff=<genome.gff3|genome.gff3.gz> [-f|--fasta=<genome.fna|genome.fna.gz>] <fastq files> [<fastq files>]... [-o|--outdir=<file>]
-h, --help print a longer help and exit
-v, --version print version and exit
-p, --paired paired end (pairs of) files
-b, --keep_bias keep biased tracts, i.e. present only in reverse or only in forward strains (default=remove)
-V, --vcf generate VCF files for each sample, around the HT regions (EXPERIMENTAL) (default=not to save)
-k, --kmer={2,...,32} kmer size flanking each side of homopolymer (default=25)
-m, --minsize={1,...,32} minimum homopolymer tract length to be compared (default=4)
-i, --minreads=<int> minimum number of reads for tract+context to be considered (default=5)
-d, --maxdist=<int> maximum distance between kmers of a flanking region to merge them into one context (default=1)
-l, --leven=<int> levenshtein distance between flanking regions to merge them into one context (after ref genome mapping)
-t, --nthreads=<int> suggested number of threads (default is to let system decide; I may not honour your suggestion btw)
-g, --gff=<genome.gff3|genome.gff3.gz> reference genome file in GFF3, preferencially with sequence
-f, --fasta=<genome.fna|genome.fna.gz> reference genome file in fasta format, if absent from GFF3
<fastq files> fastq file with reads (weirdly, fasta also possible as long as contains all reads and not only contigs)
-o, --outdir=<file> output directory, or 'random' for generating random dir name (default=current dir '.')
'Context' is the pair of flanking k-mers. If you have paired end files, then their order should be strictly f1_R1, f1_R2, f2_R1 etc.
that is, R1 and R2 should be consecutive. A GFF3 reference genome must also be supplied, and bwa will create a series of index files
if tatajuba can find the DNA sequences at end of GFF3 file, after pragma '##FASTA'
You can also supply a fasta file, e.g. when you download GFF3 from NCBI it may lack the DNA genome at the end --- then downloading the
associated fna may work. The internal library updates the GFF3 structure with provided fasta sequences (so a fasta file may
overwrite DNA sequences present in the GFF3 file. If you modify the fasta file please delete the index files generated by the BWA library
so that they can be regenerated with the updated information.
Notice that tatajuba creates files with same prefix and in same location as the GFF3 file, which may overwrite existing ones.
On the other hand, as suggested above, you can recreate all the generated files by deleting them and running tatajuba again.
The generation of the VCF files is still experimental, but seems to be accepted by snpEff.
The default values are as following:
kmer= 25 minsize= 4 minread= 5 maxdist= 1
The minimum required information are the GFF3 file (assuming it has fasta sequences, as e.g. generated by prokka
) and the reads. Usually the GFF3
file will not contain the sequence in fasta format, and thus you'll have to provide the fasta file with --fasta
for
the exact same GFF3, with matching sequence names.
I usually download the references from the RefSeq or Assembly databases on NCBI, both of which allow for downloading the genomic GFF and FASTA. The GFF3/fasta files are also discussed in the tutorial.
Also, if you have paired reads you need to add option -p
and make sure the read files are given in order, with reads
from same sample together: always sampleA_1.fq sampleA_2.fq sampleB_1.fq sampleB_2.fq
but never
sampleA_1.fq sampleB_1.fq sampleA_2.fq sampleB_2.fq
.
Usually a wildcard (*
) works since the reverse/forward flags are to the right of the sample IDs in the file names; to make sure you can try
$ ls sample*
and see the order in which the files are shown.
To make the analysis more precise, you can both increase the k-mer size (which is the legth of the flanking regions
a.k.a. contexts) with the option -k
or --kmer
, and increase the minimum coverage of a HT (the minimum mnumber of reads --minreads
or -i
).
The number of threads can be decided automatically by the program (it uses all cores available), and it is good practice
to define a directory where all output files will reside. Thus your command would be something like
tatajuba -p -g reference.gff -f reference.fasta -o outdir -k 28 -i 8 -V \
sampleA_forw.fastq.gz sampleA_rev.fastq.gz \
sampleB_forw.fastq.gz sampleB_rev.fastq.gz \
sampleC_forw.fastq.gz sampleC_rev.fastq.gz
This command also asks tatajuba to generate individual VCF files (-V
). All output will go into the directory outdir/
.
You should never use the option -b
(--keep_bias
) unless you know what you are doing, since it will keep even the
tracts that were sequenced only in one strand and never in the other (a strong indicator that the tract length may be
spurious).
The files output by tatajuba are
file name | description |
---|---|
per_sample_average_length.tsv |
feature matrix with average HT length |
per_sample_proportional_coverage.tsv |
feature matrix of "proportional coverage depth" of HT |
per_sample_modal_frequency.tsv |
feature matrix with histogram bar length of modal tract length |
selected_tracts_annotated.tsv |
debug file with difference stats per tract, for tracts in annotated regions |
selected_tracts_unknown.tsv |
debug file with difference stats per tract, for tracts outside annotated regions |
tract_list.tsv |
list of all HTs found, even those constant across samples |
variable_tracts.bed |
BED file with tract locations |
The HTs are indexed internally with the identified tid_
followed by a zero-padded number (e.g. tid_00035
). This
identifier is consistent across files within the same run, but it changes between data sets. Therefore if you run
tatajuba several times on the same data but using for instance different parameters, the tid
s may relate to distinct
HTs.
The sample descriptors are a simplified version of the file names, where tatajuba removes the common suffix and prefix.
That is, if all files are in the same directory structure, then this path will be recognised as the prefix, and if all
files end with fastq.gz
then this will be the suffix. Only the variable part of the fastq file names is kept.
Furthermore, for paired samples (where we have e.g. file_1.fq.gz
and file_2.fq.gz
sequencially) tatajuba only keeps
tract of the first file name. This means that (1) if output only refers to one of them in the pair, the same applies to the
other; and (2) the _1
will likely also be part of the suffix and thus removed.
Some files can be quite large and we suggest compressing them afterwards.
All files have columns separated by tabs, and please pay attention to missing values (represented by consecutive tabs). The "location" of the HT within the contig/chromosome is the zero-based location of the beginning of the homopolymer, and is estimated if the HT cannot be found.
It is possible for more than one HT (as defined by its tid_
) to share the same location, if tatajuba finds more than
one set of HTs (i.e. tract+contexts) mapping to the same reference genome but which are deemed too dissimilar (by the
edit distance between them). One example is the HT CCGGAAAAATTCC (track represented in bold, only fraction of
context is shown) found in both the reference and some samples. However some samples may furthermore have a HT CCGGAAATAAATTCC,
which maps (aligns optimally) to same HT as before, but which cannot be merged since the flanking region is quite different (and we have an insertion
within the tract). Tatajuba conservatively treats them as distinct HTs sharing the same location in the reference.
The feature matrix files (per_sample_
) have all the same structure: the first row is the header with the column names,
followed by the list of variable tracts, one per row, with values per sample (each column is a sample). The values try
to represent the variability within each tract per sample, keeping in mind that for each HT and sample we store a histogram of lengths.
These files only contain information about variable tracts, that is, those that differ across samples. If you are
interested in all tracts including those which do not change across samples, please look at the tract_list.csv
file.
The "average HT length" is the mean tract length over all valid HTs, to account for intrapopulation variability. It is typically close to the consensus length, as estimated by alignment/assembly software. The "proportional coverage depth" is a rough estimate of the coverage over this HT, and is the sum of all read segments belonging to this HT (i.e. tract plus flanking context) divided by the frequency of the most frequent context (i.e. a k-mer) over all HTs in the sample.
The "modal frequency" is the count of the most frequent tract lenght divided by the frequency count of all lengths, for this HT. This gives an estimate of how concentrated the length distribution is around its modal value, and should be typically close to one.
It is important to stress that (1) these values are calculated independently for each sample (i.e. they do not have any normalisation over samples), and that (2) some variables are rescaled per sample, as the "proportional coverage depth", and some are rescaled per HT (per sample, naturally), as the "average length" and the "modal frequency".
The BED file variable_tracts.bed
has the range information about the HTs which vary between samples (or in relation to
the reference genomes). It describes the leftmost and rightmost regions in the reference mapped by the HT across samples
and read segments, according to BWA-aln
, discounting the context region. It represents the longest region in the
reference where at least one sample had a tract mapped.
This range relates only to the tract and not the context (flanking regions), since a flanking region can harbour another
HT and we want to minimise the overlap.
Overlaps between BED regions can still happen, though.
Notice that in some cases the HT cannot be found in the reference genome, even though we know it exists in the samples
and it maps to a particular region in the reference genome. That is because we start by finding HTs in the samples, and not in the reference.
One example is the HT CGC<b>AAAA</b>TGT
mapping optimally to the region CGC<b>ATA</b>TGT
. In this case the "average HT length" for the reference
will be zero, but its row in the BED file will still map to the correct region (eventually with an extra base).
The columns in the BED file are the usual chrom
, chromStart
, chromEnd
, and name
, where the last one is the tract
ID tid_
.
The generation of the VCF is still experimental, but so far has been successful for observing the mutational effect of
the HTs using snpEff
(more on that below).
The VCF file is minimalistic, since it does not have most information one would expect (for that, as we explain below,
please use well-known, traditional tools).
It is also not very economical in describing the changes, since it may include the right flanking region (if it contains
SNPs for instance).
I believe bcftools norm -m
can fix this.
The VCF files are compressed with the zlib library (which produces standard gzipped files), but keep in mind that bcftools
requires its own BGZF library, which is an extension of zlib.
You can interconvert them with bgzip
, and currently this is flagged as an issue,
although I do not plan on importing the BGZF library to tatajuba.
By the way tatajuba only generates gzipped files if the library can be found, otherwise it creates uncompressed, text
files.
Also, the rows in the VCF file might be out of
order (for overlapping HTs, for instance). This can be fixed with bcftools norm
or by hand with something like
(head -n 5 in.vcf && tail -n +5 in.vcf | sort -nk 2) > out.file # skipping first 5 or so lines that we don't want sorted
But again, even with these bcftools-related issues, the VCF files work with snpEff
.
The file tract_list.tsv
contains information regarding all tracts, even those that are constant across samples (i.e.
tract in all samples have same length and very little variability within it). It contains, per row:
column | description |
---|---|
tract_id | tract ID, which can be compared across files for the same run |
contig_name | contig (or chromosome, genome) name, as given by fASTA file (and same as in GFF file) |
feature_type | feature type, as given by GFF file (if several are present, as in gene , then CD has priority) |
feature | freature name (e.g. gene name), identifier given in GFF file (or unannotated ) |
location_in_contig | zero-based location of the beginning (leftmost) of the homopolymer; if homopolymer is not found in reference, then the midpoint location between samples is used |
max_tract_length | largest HT length across samples |
ref_tract_length | HT length in reference (zero if HT not found in reference) |
tract | string representation of HT, in same strand as in reference, of most similar amongst samples. |
ref_tract | string representation of HT in reference, or equivalent region if HT missing from reference. |
The string representation of HT includes the left flanking region, the tract compressed to length one (i.e. just the base), followed by the right flanking region, as in:
TCCAATTCCGTATTCTCAACAGCTCCAA.G.CCAGCGGTACGTGCCACGCGTGCCCAAG
The HT in tract
is the most similar amongst samples and reads, for that particular HT, and may not be the most common
(highest coverage) for any sample. (See above for explanation of
possibility of several HTs mapping to same region in reference).
Thus in theory you may find a case where tract
and ref_tract
are identical in file tract_list.tsv
, however the most common HTs are different for
some samples (or even all samples, if the identical version is a "minor variant").
We call the flanking regions "context" in tatajuba (thus you may see both terms interchangeably used).
It is worth remembering that only HTs that are mapped to the reference genome (or genomes, contigs, or chromosomes) are analysed by tatajuba. Therefore we know that there is a genomic segment with significant similarity to the HT (as defined by the "left context
- homopolymer + right context"). However, once we look at the corresponding (mapped) region in the reference genome, we might not find a perfect patch, or even the same homopolymer. Therefore when seaching for the exact location of the homopolymer in the reference, we include "monomers" (i.e. in the reference, the tract may have a length of one base).
These files have summary information about tracts over samples. They are currently used only for debug purposes, since they have the descriptive statistics between samples which are used to describe a tract as variable. Please do not rely on them, they might be removed in future versions. A short description of the columns (assumes familiarity with code details):
column | description |
---|---|
tract_id | tract ID |
begin_context | location in reference contig of the beginning of the context (not homopolymer, as usually); notice that the contig info is missing |
n_genomes | number of genomes where this tracat is found |
lev_distance | maximum edit (Levenstein) distance between read segments belonging to this HT |
rd_frequency | absolute difference (MAX-MIN) between modal frequencies |
rd_avge_tract_length | absolute difference between average tract lengths |
rd_coverage | absolute difference between proportional coverages (coverage is depth of most frequent context across the genome) |
rd_context_covge | absolute difference between average coverage per context (integral/n_context ) |
rd_entropy | absolute difference between entropies |
This output is under constant development, in order to be of value for end users — they do describe, for instance, the
maximum differnce in average tract length, which can be used to infer if the samples do differ significantly, and if
this differnce may imply in an amino-acid indel (if their lengths vary mostly by 3) or if they may lead to frameshifts
(not multiples of 3). But care must be taken since currently these files are generated before tatajuba selects the
variable tracts and before tatajuba searches for the tracts in the reference (which modifies the location).
Furthermore, the contig/chromosome information is not output and thus the begin_context
cannot be mapped unless the reference
is comprised of a single FASTA entry.
So if you really want to use the information in these files, it is better at the moment to merge its info
with the other files, using the tract_id
as primary key.
But in all likelihood you won't need it.
The most commonly used information from these files is probably from files per_sample_average_length.tsv
(wich can be load and analysed through R) and
variable_tracts.bed
(to integrate tatajuba in bioinformatic pipelines). The selected_tracts_
files should not be
used in most cases, unless you know what you are doing.
The variable tracts are those whose length or distribution depart from the reference, are variable enough between the samples, or are absent from at least one sample.
Thus even if all tracts have same length as the reference, they may be marked as "variable" (thus appearing in BED file and per_sample_
tables).
The VCF file will skip them, however, since there is no variant to report for that sample!
There is some important information which is output to the terminal, like how many HTs could be mapped to the reference genome. So it is suggested to capture this putput to a file:
# the command "nohup" detaches the program, and redirects the error messages to `nohup.out` by default
nohup tatajuba -g GCF_000493495.1_TS_genomic.gff --fasta GCF_000493495.1_TS_genomic.fna.gz -o outdir ERR17* > output.txt &
In the example above, the file output.txt
will contain messages like execution time, and number of HTs mapped and
unmapped.
Tatajuba also outputs some warning messages, usually about a sample with more unmapped HTs than mapped ones. This
message is harmless to its execution, but it may indicate a missing/poor reference for this sample, or an excess of
contamination, a mixture of strains or perhaps a plasmid?
Tatajuba can partition the homopolymer tracts (HTs) into those within or outside a coding region, and furthermore for those inside a coding region the change in tract length can gives us an idea about the effect on the protein (multiples of three will lead to a missing amino acid, while other multiples will likely disrupt the downstream coding region).
If you are interested in exploring the functional effect of mutations, there are a few tools available: bcftools consequence calling, Ensembl Variant Effect Predictor (VEP), and snpEff. They all rely on the VCF file for the samples, together with the GFF3 and FASTA files for the reference genome.
With exception of bcftools csq
, which supports only ENSEMBL GFF3 files, we recommend snpEff
or VEP
.
For snpEff
, see instructions on using your own GFF3 file from biostars and from
its site, where you'll have to build a database
(directory with reference files).
You can download the configuration database we created for analysing the Campylobacter and
Bordetella data sets, and use it as a starting point for your own analysis.
For VEP, instructions can be found at the ENSEMBL site — this software seems to work better with GFF3 provided through bp_genbank2gff3 (that is, converted from the full genbank file and removing the embedded FASTA entries at the end) than with GFF3 files downloaded directly from RefSeq...
The VCF files can be generated from the BAM/SAM files, i.e. using reference-based assembly programs like minimap2 or bwa.
I particularly like snippy, which generates not only the BAM alignment and its corresponding snps.filt.vcf
files, as it annotates the predicted effects with
snpEff
into the snps.vcf
files.
Although the directories from the individual samples have relevant information,
please keep in mind that the "core SNPs" and further analyses from snippy-multi
by design will exclude most information from the HTs (which are indels).
Tatajuba outputs a BED file with the genomic regions of interest (i.e. HTs variable across samples).
Tatajuba cannot be used as a variant caller or aligner since (1) it does not use the full information available on
the fastq
files, focusing only on HTs and
neglecting many SNPs; (2) it processes and splits the reads before mapping to the reference genome — in particular
one read segment can belong to several HTs (as part of their flanking region, for instance), and one HT can contain
variable segments (flanking regions and HT length).
Having said that, tatajuba does offer the possiblity of generating a minimalist VCF files for each sample, based on the HT results. It does
not contain all variants, as explained above, and it does not use a pileup
strategy (it uses the HT to "anchor" the
sample to the reference), but it can be used to extract the variant effect around the HT regions (with the tools
described above).
The VCF files in tatajuba are still experimental, but we have been successful in using it as input to snpEff
.
In any case, incorporating tatajuba in your pipeline will help pinpointing variable HTs across samples, excluding potentially artefactual ones. It also estimates how many of those are in coding and non-coding regions, from which their length differences w.r.t to the reference genome can be used as a proxy to estimate their translational effects.