UniAmp (Unique Amplicon) is a computational pipeline to generate PCR primers specific to a target genome.
The UniAmp pipeline can be conceptually split into 4 parts:
- Build directory of query genomes with high sequence similarity to target genome.
- Retrieve unique sequences in a target genome compared to query genomes.
- Select unique target sequence for primer design.
- Design primers to unique target sequence.
UniAmp is run on Linux and requires basic Linux utilities, python3, and perl.
UniAmp contains wrappers around public bioinformatics software. The following dependencies are included with UniAmp as binaries in the bin
folder and do not need to be installed:
* To implement rnammer
in UniAmp scripts, the rnammer
script included with UniAmp was modified as described here. rnammer
also requires the HMMER2 command hmmsearch
, so the binary for this command is included in the UniAmp bin
folder.
* rnammer
requires the perl XML::Simple
module. If not already installed, the module can be installed using the command cpan install XML::Simple
.
Download repository from Github:
git clone https://github.com/kenscripts/UniAmp.git
Run the following script and specify UniAmp path:
source <path to UniAmp>/UniAmp/setup_uniamp.sh <path to UniAmp>
For specific examples on how to use the UniAmp pipeline, see *.workflow.txt
files in the docs
folder. These files show the process for designing strain-specific primers to different bacteria.
The following is a general walkthrough of the UniAmp pipeline:
Before running UniAmp scripts, execute the following script and specify the path to UniAmp:
source <path to UniAmp>/start_uniamp.sh <path to UniAmp>
The target genome is compared to query genomes to find unique target sequences. This step controls how unique the target sequences can be. For example, if a synthetic community of organisms is being studied, then only the genomes of these community members can be used as queries. However, if a high level of uniqueness is desired for unique target sequences then the user should compare query genomes with high sequence similarity to target genome.
Below are some optional UniAmp scripts to obtain query genomes with high sequence similarity to a target genome:
get_gtdb_queries.sh <GTDBTK_DATA_PATH> <GTDB_DIR> <TARGET_GNOME> <OUT_DIR>
Description:
retrieves query genomes from GTDB-tk ani_rep output that match target genome sequence
Arguments:
<GTDBTK_DATA_PATH> = path to GTDB-tk reference data
<GTDB_DIR> = directory containing GTDB-tk ani_rep output
<TARGET_GNOME> = filename of target genome sequence
<OUT_DIR> = directory for output
Dependencies:
output from GTDB-tk ani_rep
GTDB-tk reference data
get_ncbi_queries.sh <TARGET_GNOME> <TAXON> <OUT_DIR>
Description:
retrieves query genomes from NCBI of the specified taxon with > 97% 16S rRNA sequence identity to target genome sequence
Arguments:
<TARGET_GNOME> = filename of target genome sequence
<TAXON> = search for query genomes from a specific taxon
<OUT_DIR> = path for output directory
Dependencies:
datasets
rnammer
blastn
Note: If target genome sequence has previously been deposited into NCBI database then user should check the query genomes returned by get_ncbi_queries.sh
to make sure target genome sequence is not present.
Once a directory with query genomes is assembled the following script is implemented:
uni_seq.sh <TARGET_GNOME> <QUERY_DIR> <OUT_DIR>
Description:
finds unique sequences in target genome compared to query genomes by performing pw genome alignment then local alignment
Arguments:
<TARGET_GNOME> = path to target genome sequence
<QUERY_DIR> = path to directory containing query genomes
<OUT_DIR> = path to directory for output
Dependencies:
gnome_uniseq.sh:::nucmer
gnome_uniseq.sh:::show-coords
gnome_uniseq.sh:::bedtools
bioawk
local_uniseq.sh:::blastn
The output from uni_seq.sh
can produced many unique target sequences. This depends on how many query genomes were compared and how similiar these query genomes were to the target genome.
For the later steps in the UniAmp pipeline, unique target sequences are uploaded to the web server of Primer-BLAST. As a result, it is convienent to only have 1 or a few unique target sequences to work with. To accomplish this, selection criteria can be imposed to select the most optimal unique target sequence based on the user's preference.
In the original UniAmp publication, unique target sequences were filtered by size and GC content. The remaining sequences were than compared against the NCBI nucleotide collection database. The unique target sequence with no match or with the lowest similarity to any database sequence was used for primer design. This approach can be implemented by performing the following:
1) use bioawk to filter unique target sequences by size and gc content
# size: 400-800 bp
# gc: 40-60 %
$BIOAWK_PATH \
-c fastx \
'length($seq) > 400 && length($seq) <800 && gc($seq) < 0.60 && gc($seq) > 0.40 {print ">"$name"\n"$seq"\n"}' \
$UNISEQ_DIR/uni_seq.sc.fasta \
> $UNISEQ_DIR/uni_seq.filtered.fasta;
2) compare unique target sequences against NCBI database and select most unique sequence
get_remote_uniseq.sh <QUERY_FASTA> <BLASTDB> <TAXON> <OUT_DIR>
Description:
performs a remote blastn search and returns most unique query sequence
Arguments:
<QUERY_FASTA> = path for query fasta to use in blastn search
<BLAST_DB> = name of NCBI database to search against (e.g. nr)
<TAXON> = limit blastn search to specific taxon (used as entrez query for [organism])
<OUT_DIR> = path to output directory
Dependencies:
remote_blastn_lineage:::blastn
remote_blastn_lineage:::taxon
bioawk
Once a unique target sequence is selected, this sequence is uploaded to the Primer-BLAST server (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). Presently, no command-line tool exists for Primer-BLAST so the Primer-BLAST html output is saved and used in the next step.
Users can select different Primer-BLAST parameters depending on their specific needs. Below are URLs containing previously used settings for designing bacterial strain-specific primers:
Settings to design primers for end-point PCR
Settings to design primers for qPCR
In the last step of the UniAmp pipeline, the Primer-BLAST html output is parsed and a text file is created. The specificity of these primers is then tested by performing in-silico PCR on a set of input genomes, containing the target genome as well as non-target genomes.
To perform in-silico PCR, a text file needs to be created containing the paths of the target genome and non-target genomes. This file can be created using the following shell command:
realpath <GNOME_DIR> > ispcr.gnome_paths.tsv
Argument:
<GNOME_DIR> = directory containing target genome and non-target genomes to test primer pair specificity
Once the text file containing paths to the target genome and non-target genomes is created, the following script can be implemented:
uni_pcr.sh <PB_HTML> <GNOME_PATHS> <TARGET_GNOME> <OUT_DIR>
Description:
parses primer blast output and uses primers to perform in-silico PCR on target and non-target genomes
Arguments:
<PB_HTML> = path to Primer-BLAST html output
<GNOME_PATHS> = path to file containing paths to target and non-target genome files
<TARGET_GNOME> = path to target genome sequence
<OUT_DIR> = path to output directory
Dependencies:
pb_parser.py:::BeautifulSoup4 python package
run_isPCR.sh:::usearch