metaSNV supports variant (SNV) calling on metagenomic data and population genetic analysis, including subspecies identification and profiling. Input is metagenomic reads mapped against reference genomes (bam files).
- metaSNV v2 paper: Van Rossum, et al. 2021. metaSNV v2: detection of SNVs and subspecies in prokaryotic metagenomes. Bioinformatics.
- metaSNV v1 paper: Costea, et al. 2017. metaSNV: A tool for metagenomic strain level analysis. PLOS One.
See the full documentation for more details and an test example on this page below.
Create a new conda environment with metaSNV installed:
conda create --name metaSNV -c bioconda -c conda-forge 'metasnv>=2.0.1'
Install metaSNV in an existing conda environment:
conda install -c bioconda -c conda-forge 'metasnv>=2.0.1'
To test that all files and dependencies have been properly installed, run the following:
metaSNV.py --help
metaSNV_Filtering.py --help
metaSNV_DistDiv.py --help
metaSNV_subpopr.R --help
Refer to the developer guide.
We recommend using genomes from ProGenomes2 as your species references. The version provided here is a subset with one representative genome per species (the longest of the represenatatives).
To use the provided reference database, download the species genome reference fasta file (ref_db
) and the gene annotation file (db_ann
) with the following.
These files will take approx. 25 GB of space.
wget http://swifter.embl.de/~ralves/metaSNV_reference_data/progenomes2_speciesReps_genomes.fna
wget http://swifter.embl.de/~ralves/metaSNV_reference_data/progenomes2_speciesReps_annotations.txt
This summarises the general workflow. See the full documentation for details.
- 'all_samples' = a list of all BAM files, one /path/to/sample.bam per line (no duplicates, no empty mappings)
- 'ref_db' = the reference database in fasta format (i.e. multi-sequence fasta) and write permission for its directory
- 'db_ann' = [optional] a gene annotation file for the reference database (see documentation for custom files).
metaSNV.py output_dir/ all_samples ref_db [options]
Note: requires SNV calling (Part I) to be done
metaSNV_Filtering.py output_dir [options]
metaSNV_DistDiv.py --filt output_dir/filtered/pop [options]
Note: requires SNV calling, filtering, and distance calculations to be done (Parts I & II)
metaSNV_subpopr.R -i output_dir [options]
To determine abundances of subspecies relative to the whole community, you will also need to provide species abundance profiles. To determine subspecies-associated gene content, you will need to provide per-metagenome gene abundance profiles. See the full documentation for details.
This test example uses in silico generated data so that results compute quickly. There are 160 metagenomic samples with abundance of 3 species. The 'species' are called "refGenome1clus", "refGenome2clus", and "refGenome3clus", with 1, 2, and 3 subspecies, respectively. See the manual for details on the formats of the output files.
This will need 450M of disk space and, after the downloads are complete, should take approximately 5 minutes to run with 3 processors/threads or 15 minutes unparallelised.
wget http://swifter.embl.de/~ralves/metaSNV_test_data/testdata.tar.xz
tar xvf testdata.tar.xz && rm -f testdata.tar.xz
Run all steps of metaSNV v2 with the test data:
metaSNV.py --threads 3 output testdata/all_samples testdata/ref/allReferenceGenomes.fasta
or if running from source
python metaSNV.py --threads 3 output testdata/all_samples testdata/ref/allReferenceGenomes.fasta
To also detect whether SNVs result in codon changes, add --db_ann testdata/ref/allReferenceGenomes_geneAnno.txt
which will add to the SNV output: the gene within which the SNV was detected and the original and resultant codon. The file testdata/ref/allReferenceGenomes_geneAnno.txt
contains gene names and locations within the reference genomes. See the documentation for details.
Your SNVs are now in output/snpCaller/
.
- If you ran with >1 thread: Your SNVs are now in files named with the pattern:
output/snpCaller/called_SNPs.best_split_*
. If you ran with 3 threads, then each species will have it's own file and you should have 1080, 2094, and 3064 SNVs per file, one per line. - If you ran with 1 thread: Your SNVs are now in
output/snpCaller/called_SNPs
. You should have 6238 SNVs in this file, one per line.
metaSNV_Filtering.py --n_threads 3 output
or if running from source
python metaSNV_Filtering.py --n_threads 3 output
This command filtered your SNVs (with default paramters) and calculated allele frequencies. Your filtered SNV allele frequencies are now in the output/filtered/pop/
folder. Each species has its own file.
The number of SNVs expected per file are:
# SNVs | File |
---|---|
1023 | output/filtered/pop/refGenome1clus.filtered.freq |
2075 | output/filtered/pop/refGenome2clus.filtered.freq |
3061 | output/filtered/pop/refGenome3clus.filtered.freq |
metaSNV_DistDiv.py --n_threads 3 --filt output/filtered/pop --dist
or if running from source
python metaSNV_DistDiv.py --n_threads 3 --filt output/filtered/pop --dist
This command calculated pairwise dissimilarities between samples based on filtered SNV allele frequencies. Your filtered SNV allele frequencies are now in the output/distances
folder. Each species has its own file with 160 samples (161 lines with the header).
metaSNV_subpopr.R --procs 3 -i output -g testdata/abunds/geneAbundances.tsv -a testdata/abunds/speciesAbundances.tsv
or if running from source
Rscript metaSNV_subpopr.R --procs 3 -i output -g testdata/abunds/geneAbundances.tsv -a testdata/abunds/speciesAbundances.tsv
This command detected subspecies. The results are in the results/params.hr10.hs80.ps80.gs80/output/
folder. See the manual for a description of all the files produced.
A summary of the results is provided in results/params.hr10.hs80.ps80.gs80/output/resultsSummary.html
and results/params.hr10.hs80.ps80.gs80/output/summary_allResults.csv
.
An overview of the results per species is provided in the files named with pattern: refGenome*clus_detailedSpeciesReport.html
.
This will include PCoA plots illustrating the clustering and tables with the number of distinctive genes. For example, for in silico "species" "refGenome3clus":
See the full documentation for details on parameters, inputs, outputs, and the method.