Skip to content

Latest commit

 

History

History
342 lines (279 loc) · 24 KB

README.md

File metadata and controls

342 lines (279 loc) · 24 KB

Usage instructions

See also: Tutorial

Here is a hands-on tutorial on how to run tatajuba, and how to analyse its output. You can also download all files necessary for this example on example_tutorial.txz. It is archived with tar and xz, which means that after downloading you need to

tar Jxvf example_tutorial.txz

See also: Supplementary material for the manuscript

The file 211108.figures_snippy_comparison.ipynb contains the jupyter notebook for the updated Campylobacter and Bordetella homopolymer tract analyses initially for the manuscript https://doi.org/10.1101/2021.06.02.446710. You must also download 211108.supplementary_material.txz to obtain the output files, extra files, and the supplementary tables with the accession IDs of the fastq files used in this work.

See also: Installation instructions

The main README file contains installation instructions.

Usage

the basic help can be invoked with tatajuba -h, which will look something like

tatajuba 1.0.4
Compare histograms of homopolymeric tract lengths, within context.
The complete syntax is:

 tatajuba  [-h|--help] [-v|--version] [-p|--paired] [-b|--keep_bias] [-V|--vcf] [-k|--kmer={2,...,32}] [-m|--minsize={1,...,32}] [-i|--minreads=<int>] [-d|--maxdist=<int>] [-l|--leven=<int>] [-t|--nthreads=<int>]
-g|--gff=<genome.gff3|genome.gff3.gz> [-f|--fasta=<genome.fna|genome.fna.gz>] <fastq files> [<fastq files>]... [-o|--outdir=<file>]

  -h, --help                       print a longer help and exit
  -v, --version                    print version and exit
  -p, --paired                     paired end (pairs of) files
  -b, --keep_bias                  keep biased tracts, i.e. present only in reverse or only in forward strains (default=remove)
  -V, --vcf                        generate VCF files for each sample, around the HT regions (EXPERIMENTAL) (default=not to save)
  -k, --kmer={2,...,32}            kmer size flanking each side of homopolymer (default=25)
  -m, --minsize={1,...,32}         minimum homopolymer tract length to be compared (default=4)
  -i, --minreads=<int>             minimum number of reads for tract+context to be considered (default=5)
  -d, --maxdist=<int>              maximum distance between kmers of a flanking region to merge them into one context (default=1)
  -l, --leven=<int>                levenshtein distance between flanking regions to merge them into one context (after ref genome mapping)
  -t, --nthreads=<int>             suggested number of threads (default is to let system decide; I may not honour your suggestion btw)
  -g, --gff=<genome.gff3|genome.gff3.gz> reference genome file in GFF3, preferencially with sequence
  -f, --fasta=<genome.fna|genome.fna.gz> reference genome file in fasta format, if absent from GFF3
  <fastq files>                    fastq file with reads (weirdly, fasta also possible as long as contains all reads and not only contigs)
  -o, --outdir=<file>              output directory, or 'random' for generating random dir name (default=current dir '.')
'Context' is the pair of flanking k-mers. If you have paired end files, then their order should be strictly f1_R1, f1_R2, f2_R1 etc.
  that is, R1 and R2 should be consecutive. A GFF3 reference genome must also be supplied, and bwa will create a series of index files
  if tatajuba can find the DNA sequences at end of GFF3 file, after pragma '##FASTA'

You can also supply a fasta file, e.g. when you download GFF3 from NCBI it may lack the DNA genome at the end --- then downloading the
  associated fna may work. The internal library updates the GFF3 structure with provided fasta sequences (so a fasta file may
  overwrite DNA sequences present in the GFF3 file. If you modify the fasta file please delete the index files generated by the BWA library
  so that they can be regenerated with the updated information.

Notice that tatajuba creates files with same prefix and in same location as the GFF3 file, which may overwrite existing ones.
  On the other hand, as suggested above, you can recreate all the generated files by deleting them and running tatajuba again.

The generation of the VCF files is still experimental, but seems to be accepted by snpEff.

The default values are as following:
kmer= 25         minsize=  4     minread=  5     maxdist=  1

The minimum required information are the GFF3 file (assuming it has fasta sequences, as e.g. generated by prokka) and the reads. Usually the GFF3 file will not contain the sequence in fasta format, and thus you'll have to provide the fasta file with --fasta for the exact same GFF3, with matching sequence names.

I usually download the references from the RefSeq or Assembly databases on NCBI, both of which allow for downloading the genomic GFF and FASTA. The GFF3/fasta files are also discussed in the tutorial.

Also, if you have paired reads you need to add option -p and make sure the read files are given in order, with reads from same sample together: always sampleA_1.fq sampleA_2.fq sampleB_1.fq sampleB_2.fq but never sampleA_1.fq sampleB_1.fq sampleA_2.fq sampleB_2.fq. Usually a wildcard (*) works since the reverse/forward flags are to the right of the sample IDs in the file names; to make sure you can try

$ ls sample*

and see the order in which the files are shown.

To make the analysis more precise, you can both increase the k-mer size (which is the legth of the flanking regions a.k.a. contexts) with the option -k or --kmer, and increase the minimum coverage of a HT (the minimum mnumber of reads --minreads or -i). The number of threads can be decided automatically by the program (it uses all cores available), and it is good practice to define a directory where all output files will reside. Thus your command would be something like

tatajuba -p -g reference.gff -f reference.fasta -o outdir -k 28 -i 8 -V \
                             sampleA_forw.fastq.gz sampleA_rev.fastq.gz \
                             sampleB_forw.fastq.gz sampleB_rev.fastq.gz \
                             sampleC_forw.fastq.gz sampleC_rev.fastq.gz

This command also asks tatajuba to generate individual VCF files (-V). All output will go into the directory outdir/. You should never use the option -b (--keep_bias) unless you know what you are doing, since it will keep even the tracts that were sequenced only in one strand and never in the other (a strong indicator that the tract length may be spurious).

Output files

The files output by tatajuba are

file name description
per_sample_average_length.tsv feature matrix with average HT length
per_sample_proportional_coverage.tsv feature matrix of "proportional coverage depth" of HT
per_sample_modal_frequency.tsv feature matrix with histogram bar length of modal tract length
selected_tracts_annotated.tsv debug file with difference stats per tract, for tracts in annotated regions
selected_tracts_unknown.tsv debug file with difference stats per tract, for tracts outside annotated regions
tract_list.tsv list of all HTs found, even those constant across samples
variable_tracts.bed BED file with tract locations

The HTs are indexed internally with the identified tid_ followed by a zero-padded number (e.g. tid_00035). This identifier is consistent across files within the same run, but it changes between data sets. Therefore if you run tatajuba several times on the same data but using for instance different parameters, the tids may relate to distinct HTs.

The sample descriptors are a simplified version of the file names, where tatajuba removes the common suffix and prefix. That is, if all files are in the same directory structure, then this path will be recognised as the prefix, and if all files end with fastq.gz then this will be the suffix. Only the variable part of the fastq file names is kept. Furthermore, for paired samples (where we have e.g. file_1.fq.gz and file_2.fq.gz sequencially) tatajuba only keeps tract of the first file name. This means that (1) if output only refers to one of them in the pair, the same applies to the other; and (2) the _1 will likely also be part of the suffix and thus removed.

Some files can be quite large and we suggest compressing them afterwards.

All files have columns separated by tabs, and please pay attention to missing values (represented by consecutive tabs). The "location" of the HT within the contig/chromosome is the zero-based location of the beginning of the homopolymer, and is estimated if the HT cannot be found.

It is possible for more than one HT (as defined by its tid_) to share the same location, if tatajuba finds more than one set of HTs (i.e. tract+contexts) mapping to the same reference genome but which are deemed too dissimilar (by the edit distance between them). One example is the HT CCGGAAAAATTCC (track represented in bold, only fraction of context is shown) found in both the reference and some samples. However some samples may furthermore have a HT CCGGAAATAAATTCC, which maps (aligns optimally) to same HT as before, but which cannot be merged since the flanking region is quite different (and we have an insertion within the tract). Tatajuba conservatively treats them as distinct HTs sharing the same location in the reference.

per_sample_ feature matrix files

The feature matrix files (per_sample_) have all the same structure: the first row is the header with the column names, followed by the list of variable tracts, one per row, with values per sample (each column is a sample). The values try to represent the variability within each tract per sample, keeping in mind that for each HT and sample we store a histogram of lengths. These files only contain information about variable tracts, that is, those that differ across samples. If you are interested in all tracts including those which do not change across samples, please look at the tract_list.csv file.

The "average HT length" is the mean tract length over all valid HTs, to account for intrapopulation variability. It is typically close to the consensus length, as estimated by alignment/assembly software. The "proportional coverage depth" is a rough estimate of the coverage over this HT, and is the sum of all read segments belonging to this HT (i.e. tract plus flanking context) divided by the frequency of the most frequent context (i.e. a k-mer) over all HTs in the sample.

The "modal frequency" is the count of the most frequent tract lenght divided by the frequency count of all lengths, for this HT. This gives an estimate of how concentrated the length distribution is around its modal value, and should be typically close to one.

It is important to stress that (1) these values are calculated independently for each sample (i.e. they do not have any normalisation over samples), and that (2) some variables are rescaled per sample, as the "proportional coverage depth", and some are rescaled per HT (per sample, naturally), as the "average length" and the "modal frequency".

BED file

The BED file variable_tracts.bed has the range information about the HTs which vary between samples (or in relation to the reference genomes). It describes the leftmost and rightmost regions in the reference mapped by the HT across samples and read segments, according to BWA-aln, discounting the context region. It represents the longest region in the reference where at least one sample had a tract mapped. This range relates only to the tract and not the context (flanking regions), since a flanking region can harbour another HT and we want to minimise the overlap. Overlaps between BED regions can still happen, though.

Notice that in some cases the HT cannot be found in the reference genome, even though we know it exists in the samples and it maps to a particular region in the reference genome. That is because we start by finding HTs in the samples, and not in the reference. One example is the HT CGC<b>AAAA</b>TGT mapping optimally to the region CGC<b>ATA</b>TGT. In this case the "average HT length" for the reference will be zero, but its row in the BED file will still map to the correct region (eventually with an extra base). The columns in the BED file are the usual chrom, chromStart, chromEnd, and name, where the last one is the tract ID tid_.

VCF files

The generation of the VCF is still experimental, but so far has been successful for observing the mutational effect of the HTs using snpEff (more on that below). The VCF file is minimalistic, since it does not have most information one would expect (for that, as we explain below, please use well-known, traditional tools). It is also not very economical in describing the changes, since it may include the right flanking region (if it contains SNPs for instance). I believe bcftools norm -m can fix this.

The VCF files are compressed with the zlib library (which produces standard gzipped files), but keep in mind that bcftools requires its own BGZF library, which is an extension of zlib. You can interconvert them with bgzip, and currently this is flagged as an issue, although I do not plan on importing the BGZF library to tatajuba. By the way tatajuba only generates gzipped files if the library can be found, otherwise it creates uncompressed, text files.

Also, the rows in the VCF file might be out of order (for overlapping HTs, for instance). This can be fixed with bcftools norm or by hand with something like

(head -n 5 in.vcf && tail -n +5 in.vcf | sort -nk 2) > out.file  # skipping first 5 or so lines that we don't want sorted

But again, even with these bcftools-related issues, the VCF files work with snpEff.

list of HTs

The file tract_list.tsv contains information regarding all tracts, even those that are constant across samples (i.e. tract in all samples have same length and very little variability within it). It contains, per row:

column description
tract_id tract ID, which can be compared across files for the same run
contig_name contig (or chromosome, genome) name, as given by fASTA file (and same as in GFF file)
feature_type feature type, as given by GFF file (if several are present, as in gene, then CD has priority)
feature freature name (e.g. gene name), identifier given in GFF file (or unannotated)
location_in_contig zero-based location of the beginning (leftmost) of the homopolymer; if homopolymer is not found in reference, then the midpoint location between samples is used
max_tract_length largest HT length across samples
ref_tract_length HT length in reference (zero if HT not found in reference)
tract string representation of HT, in same strand as in reference, of most similar amongst samples.
ref_tract string representation of HT in reference, or equivalent region if HT missing from reference.

The string representation of HT includes the left flanking region, the tract compressed to length one (i.e. just the base), followed by the right flanking region, as in:

TCCAATTCCGTATTCTCAACAGCTCCAA.G.CCAGCGGTACGTGCCACGCGTGCCCAAG

The HT in tract is the most similar amongst samples and reads, for that particular HT, and may not be the most common (highest coverage) for any sample. (See above for explanation of possibility of several HTs mapping to same region in reference). Thus in theory you may find a case where tract and ref_tract are identical in file tract_list.tsv, however the most common HTs are different for some samples (or even all samples, if the identical version is a "minor variant"). We call the flanking regions "context" in tatajuba (thus you may see both terms interchangeably used).

It is worth remembering that only HTs that are mapped to the reference genome (or genomes, contigs, or chromosomes) are analysed by tatajuba. Therefore we know that there is a genomic segment with significant similarity to the HT (as defined by the "left context

  • homopolymer + right context"). However, once we look at the corresponding (mapped) region in the reference genome, we might not find a perfect patch, or even the same homopolymer. Therefore when seaching for the exact location of the homopolymer in the reference, we include "monomers" (i.e. in the reference, the tract may have a length of one base).

selected_tracts_ files for debugging

These files have summary information about tracts over samples. They are currently used only for debug purposes, since they have the descriptive statistics between samples which are used to describe a tract as variable. Please do not rely on them, they might be removed in future versions. A short description of the columns (assumes familiarity with code details):

column description
tract_id tract ID
begin_context location in reference contig of the beginning of the context (not homopolymer, as usually); notice that the contig info is missing
n_genomes number of genomes where this tracat is found
lev_distance maximum edit (Levenstein) distance between read segments belonging to this HT
rd_frequency absolute difference (MAX-MIN) between modal frequencies
rd_avge_tract_length absolute difference between average tract lengths
rd_coverage absolute difference between proportional coverages (coverage is depth of most frequent context across the genome)
rd_context_covge absolute difference between average coverage per context (integral/n_context)
rd_entropy absolute difference between entropies

This output is under constant development, in order to be of value for end users — they do describe, for instance, the maximum differnce in average tract length, which can be used to infer if the samples do differ significantly, and if this differnce may imply in an amino-acid indel (if their lengths vary mostly by 3) or if they may lead to frameshifts (not multiples of 3). But care must be taken since currently these files are generated before tatajuba selects the variable tracts and before tatajuba searches for the tracts in the reference (which modifies the location). Furthermore, the contig/chromosome information is not output and thus the begin_context cannot be mapped unless the reference is comprised of a single FASTA entry. So if you really want to use the information in these files, it is better at the moment to merge its info with the other files, using the tract_id as primary key. But in all likelihood you won't need it.

Final comments on output

The most commonly used information from these files is probably from files per_sample_average_length.tsv (wich can be load and analysed through R) and variable_tracts.bed (to integrate tatajuba in bioinformatic pipelines). The selected_tracts_ files should not be used in most cases, unless you know what you are doing.

The variable tracts are those whose length or distribution depart from the reference, are variable enough between the samples, or are absent from at least one sample. Thus even if all tracts have same length as the reference, they may be marked as "variable" (thus appearing in BED file and per_sample_ tables). The VCF file will skip them, however, since there is no variant to report for that sample!

There is some important information which is output to the terminal, like how many HTs could be mapped to the reference genome. So it is suggested to capture this putput to a file:

# the command "nohup" detaches the program, and redirects the error messages to `nohup.out` by default 
nohup tatajuba -g GCF_000493495.1_TS_genomic.gff --fasta GCF_000493495.1_TS_genomic.fna.gz -o outdir ERR17* > output.txt &

In the example above, the file output.txt will contain messages like execution time, and number of HTs mapped and unmapped. Tatajuba also outputs some warning messages, usually about a sample with more unmapped HTs than mapped ones. This message is harmless to its execution, but it may indicate a missing/poor reference for this sample, or an excess of contamination, a mixture of strains or perhaps a plasmid?

Mutational effect

Tatajuba can partition the homopolymer tracts (HTs) into those within or outside a coding region, and furthermore for those inside a coding region the change in tract length can gives us an idea about the effect on the protein (multiples of three will lead to a missing amino acid, while other multiples will likely disrupt the downstream coding region).

If you are interested in exploring the functional effect of mutations, there are a few tools available: bcftools consequence calling, Ensembl Variant Effect Predictor (VEP), and snpEff. They all rely on the VCF file for the samples, together with the GFF3 and FASTA files for the reference genome.

With exception of bcftools csq, which supports only ENSEMBL GFF3 files, we recommend snpEff or VEP. For snpEff, see instructions on using your own GFF3 file from biostars and from its site, where you'll have to build a database (directory with reference files). You can download the configuration database we created for analysing the Campylobacter and Bordetella data sets, and use it as a starting point for your own analysis.

For VEP, instructions can be found at the ENSEMBL site — this software seems to work better with GFF3 provided through bp_genbank2gff3 (that is, converted from the full genbank file and removing the embedded FASTA entries at the end) than with GFF3 files downloaded directly from RefSeq...

The VCF files can be generated from the BAM/SAM files, i.e. using reference-based assembly programs like minimap2 or bwa. I particularly like snippy, which generates not only the BAM alignment and its corresponding snps.filt.vcf files, as it annotates the predicted effects with snpEff into the snps.vcf files. Although the directories from the individual samples have relevant information, please keep in mind that the "core SNPs" and further analyses from snippy-multi by design will exclude most information from the HTs (which are indels).

Tatajuba outputs a BED file with the genomic regions of interest (i.e. HTs variable across samples). Tatajuba cannot be used as a variant caller or aligner since (1) it does not use the full information available on the fastq files, focusing only on HTs and neglecting many SNPs; (2) it processes and splits the reads before mapping to the reference genome — in particular one read segment can belong to several HTs (as part of their flanking region, for instance), and one HT can contain variable segments (flanking regions and HT length).

Having said that, tatajuba does offer the possiblity of generating a minimalist VCF files for each sample, based on the HT results. It does not contain all variants, as explained above, and it does not use a pileup strategy (it uses the HT to "anchor" the sample to the reference), but it can be used to extract the variant effect around the HT regions (with the tools described above). The VCF files in tatajuba are still experimental, but we have been successful in using it as input to snpEff.

In any case, incorporating tatajuba in your pipeline will help pinpointing variable HTs across samples, excluding potentially artefactual ones. It also estimates how many of those are in coding and non-coding regions, from which their length differences w.r.t to the reference genome can be used as a proxy to estimate their translational effects.