-
Notifications
You must be signed in to change notification settings - Fork 45
Troubleshooting, common issues, and recommendations
Documentation current for HybPiper version 2.1.2
Bioinformatic sequence recovery for universal target-capture bait kits can be substantially improved by appropriate tailoring of target files to the group under study. If you have poor gene recovery for many samples, you may want to extract the best sequences at each gene and add them to your target file. Rerunning HybPiper with this new set of targets should improve recovery for many samples.
Here are some suggestions if you are using the Angiosperms353 Universal Probes for flowering plants:
- Use the "mega353" curated target file available here.
- Search for close representative sequences in the Kew PAFTOL Tree of Life Explorer
- Run HybPiper with distantly related targets and extract coding sequences from the best samples. Add these to the target file and re-run HybPiper for the samples with low recovery.
Target files containing sequences with regions of low complexity can cause issues with HybPiper. This is because:
-
Many low-complexity sample reads can map to regions of low complexity in the target file sequences. These low-complexity reads are likely to comprise largely off-target reads. Millions of low-complexity reads can map to a low-complexity region in a target sequence.
-
The SPAdes assembler then attempts to assemble the large, low-complexity read set for the given gene. This can take a long time, and often produces hundreds of low-complexity contigs.
-
These low complexity contigs are passed on to the exon-extraction stage of the pipeline, during which the 'best' target file protein sequence for a given gene (translated from a DNA sequence if necessary) is aligned with the DNA contigs using the program Exonerate. Again, this can take a very long time if there are many contigs, and produces astronomically large log files if the
--verbose_logging
flag is used withhybpiper assemble
.
To deal with these issues, HybPiper provides:
-
The parameter
--timeout_assemble X
. If provided, this will kill long-running SPAdes gene assemblies if they take longer than X percent of the average gene assembly time. A running median assembly time is calculated after at least 3 assemblies have been completed. Note that no gene sequence will be produced for the sample if the gene assembly is killed. By default, all assemblies will be allowed to complete with no time limit. -
The parameter
--timeout_exonerate_contigs X
. The default timeout for the exon-extraction stage is 120 seconds. After this time, processing of a given gene will be cancelled and no gene sequence will be produced for the sample. HybPiper will log a list of genes that were cancelled due to timeout. In our testing, no gene that did not have hundreds of low complexity SPAdes contigs took longer than 120 seconds, but this number is adjustable using the--timeout_exonerate_contigs
parameter if necessary.
STRONGLY RECOMMENDED:
We recommend checking your target file for sequences with low-complexity regions prior to running hybpiper assemble
. To assist with this, HybPiper provides the command hybpiper check_targetfile
. This will:
-
Scan along each sequence in your target file using a sliding window of 100 nucleotides (default for a DNA target file) or 50 amino-acids (default for a protein target file). The sliding window size is adjustable using the parameter
--sliding_window_size
. Within each sliding window, the complexity of the sequence will be calculated using Shannon entropy. -
If a sequence within a sliding window falls beneath a given 'complexity threshold' (see below), the sequence is reported as having low-complexity regions.
-
Write a control file called
fix_target_<date_time>.ctl
, containing a list of names for sequences with low-complexity regions. This control file can be used as input to the commandhybpiper fix_targetfile
.
The Shannon entropy is calculated using a base of 2. This means that for a DNA target file containing only ATCG characters (i.e. assuming no N or ambiguity characters), the maximum 'complexity value' achievable is ~2.0. For a protein target file containing only ARNDCQEGHILKMFPSTWYV characters (i.e. assuming no ambiguity characters), the maximum 'complexity value' achievable is ~4.0. The default thresholds used by the hybpiper check_targetfile
command are 1.5 (DNA) and 3.0 (protein), but this can be adjusted using the --complexity_minimum_threshold
parameter.
Note: Using these defaults, the sequences flagged as having low-complexity regions might differ between a DNA target file and its corresponding protein translation.
We recommend removing flagged sequences from your target file before running hybpiper assemble
. To assist with this, HybPiper provides the command hybpiper fix_targetfile
.
Short summary:
If your target file contains sequences that are closely related to your samples, then having protein-coding nucleotide sequences in your target file that do not begin in the correct frame (i.e. can't be translated into the corresponding protein sequence in the first forwards frame) won't matter much; the 5' and 3' ends of some recovered gene sequences might be slightly truncated. However, if your target file does not contain sequences closely related to your samples, having out-of-frame nucleotide sequences in your target file can matter a lot - you might see much larger truncations in the sequences recovered, or no sequence might be recovered at all for some genes.
Long explanation:
After HybPiper has assembled contigs for a given gene, it extracts coding sequence (that is, exons only) from the contigs via the following process:
- Select the 'best' target file sequence for a given gene as a reference protein (translated from the nucleotide sequence in the first forwards frame if a nucleotide target file is provided). This should correspond to the target file sequence that is most similar to a given gene/sample.
- Align the reference protein to the nucleotide contigs using the program Exonerate, with an intron-aware model (protein2genome).
- Extract and concatenate regions of the contigs identified as exons in the Exonerate search (i.e. regions that, when translated, have good matches to the reference protein; note that Exonerate translates the contigs in all 6 frames when trying to align the protein reference).
Given the above process, issues can arise when the reference target file protein sequence is translated in the wrong frame, and the taxon from which the reference protein originates is somewhat distantly related to the given sample. This is particularly problematic if the reference sequence has been translated from the third wobble position, as these positions are more likely to have mutated at a higher rate (due to reduced positive or purifying selection) and to different nucleotides when comparing the target file sequence and the gene contig. This can often mean that the translated amino-acids differ between the two sequences.
To illustrate this, here's a fragment of an Exonerate alignment between a target file reference protein and a nucleotide contig that's been assembled from sample reads. In this case, the reference protein (top) is closely related to the sample and has been translated in the correct frame:
20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA : 41
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA
600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537
...whereas here is the alignment when the reference protein has been translated in the third frame:
18 : ArgIleTyrTyrAlaThrGlyLeuGlyCysLysArg***LeuProAsnTrpTrpLysCysPheA : 39
!:!||||||||||||||||||||| !!||| !!|||||||||||||||||||||||| !!||||
GlnIleTyrTyrAlaThrGlyLeuArgCys***Arg***LeuProAsnTrpTrpLysArgPheA
604 : CAGATTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCA : 541
As can be seen, the second alignment is less similar and contains stop codons, as expected given the query reference 'protein' (and hence the best contig hit translation) has been translated in the wrong frame. However, if the taxa from which the two sequences originate are fairly closely related, the translation of the nucleotide contig in the third frame can still match the reference 'protein' well enough for Exonerate to produce an alignment, and this results in HybPiper recovering a sequence for this gene.
However, if the target file sequence is less closely related, it doesn't work as well. Here's an Exonerate alignment using the same gene contig but with a more distantly related protein reference (translated in the correct frame):
20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysHisIleGlyGlyAsnIleSerThrA : 41
|||||||||||||||||||||||||||||||||||||||!!.||||||||||||:!!|||||||
PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA
600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537
This looks fine, but if the reference protein is translated in the third frame (i.e. sequence N*SWLCVGEFKLLCGK*GVYYAA*PGSKR*LSHRWQHFN*RRWPTFHTLRFTSWKCTWS*SCLGRWNCSRYAYYFKERQHWI*SEALIYWK*RLTRNCH*NCNTY
), then the similarity falls beneath the threshold Exonerate requires to return an alignment; the result of the Exonerate search is empty, and HybPiper does not recover a sequence for this gene.
To assist with correcting this issue with target files containing nucleotide sequences, HybPiper provides the command hybpiper fix_targetfile
. See below.
As described above, target files can have several issues that adversely effect sequence recovery using HybPiper. To help address these issues, HybPiper provides the commands hybpiper check_targetfile
and hybpiper fix_targetfile
.
Checking a target file for issues:
For a target file containing nucleotide sequences, the command hybpiper check_targetfile
will:
- Check that the sequence fasta headers are formatted correctly (i.e.
Taxon-geneName
), and that no duplicate sequence names are found. - Check that the sequences do not contain any unexpected stop codons (a single terminal stop codon is fine, but this can be changed via the
--no_terminal_stop_codons
flag). - Check that the sequences are a multiple of three - that is, that they contain putative complete codons only. Note that this process assumes that the sequences correspond to protein coding genes.
- Check if the sequences contain low-complexity regions (based on user-adjustable thresholds).
- Write a control file with the name
fix_target_<date_time>.ctl
. This text file will contain a list of names for sequences with low-complexity regions (if found), as well as other important parameters from thehybpiper check_targetfile
run.
For a target file containing protein sequences, the command hybpiper check_targetfile
will:
- Check that the sequence fasta headers are formatted correctly (i.e.
Taxon-geneName
), and that no duplicate sequence names are found. - Check if the sequences contain low-complexity regions (based on user-adjustable thresholds).
- Write a control file with the name
fix_target_<date_time>.ctl
. This text file will contain a list of names for sequences with low-complexity regions (if any), as well as other important parameters from thehybpiper check_targetfile
run.
The control file written by hybpiper check_targetfile
can then be used as input to the command hybpiper fix_targetfile
.
Fixing issues in a target file:
When running hybpiper fix_targetfile
with a target file containing nucleotide sequences, you can optionally provide a fasta file of protein sequences for one or more genes, via the parameter --reference_protein_file
. These protein sequences are used as 'external' references when selecting between multiple candidate open reading frames (if present) in the target file sequences, as described below.
The command hybpiper fix_targetfile
performs the following steps:
-
If a sequence contains a single forward open reading frame with no unexpected stop codons, and a corresponding 'external' protein reference is not provided, this reading frame is accepted as the 'correct' frame. The sequence is trimmed to contain only complete codons (if necessary). The corresponding protein translation is then available as an 'internal' reference for this gene, and can be used when selecting between multiple candidate reading frames for other sequences. See below for more details.
-
If a sequence contains a single candidate forward open reading frame with no unexpected stop codons, and a corresponding 'external' protein reference is provided via the parameter
--reference_protein_file
, check whether the translation of the candidate frame is less than a maximum allowed distance from the reference. If it is, accept the candidate frame as 'correct'. If not, remove the sequence from the fixed target file. -
If a sequence contains multiple candidate forward open reading frames with no unexpected stop codons, check for an 'external' protein reference. If found, select the reading frame with the least distance from the reference. If all candidate frames exceed the maximum allowed distance from the reference, remove the sequence from the fixed target file. If no 'external' protein reference is found, save the candidate frames until all sequences have been processed with the steps above.
-
For sequences with more than once candidate frame but no 'external' reference, check for an 'internal' reference (from step 1 above). If found, select the reading frame with the least distance from the reference. If all candidate frames exceed the maximum allowed distance from the reference, remove the sequence from the fixed target file. Note that if there is more than one 'internal' reference for a given gene, the longest sequence is used.
-
Filter all remaining accepted sequences/frames by length, on a per-gene basis, using the
--filter_by_length_percentage
parameter threshold. If there is more than one representative sequence for a given gene, sequences less than the threshold length of the longest representative sequence will be removed. This can be useful because if such shorter target sequences are selected as the 'best' target file reference for Exonerate searches of assembled contigs when runninghybpiper assemble
, they will limit the maximum length of locus sequence that can be recovered. Note that this threshold defaults to0.0
(all sequences retained). -
If sequences with low-complexity regions were detected by
hybpiper check_targetfile
(and hence listed in the control*.ctl
file), remove such sequences from the fixed target file (if not already removed by the previous steps). This behaviour can be controlled by the--keep_low_complexity_sequences
flag.
When processing a target file containing protein sequences with hybpiper fix_targetfile
, the following steps are performed:
-
Filter all sequences on a per-gene basis, using the
--filter_by_length_percentage
parameter threshold. If there is more than one representative sequence for a given gene, sequences less than the threshold length of the longest representative sequence will be removed. This can be useful because if such shorter target sequences are selected as the 'best' target file reference for Exonerate searches of assembled contigs when runninghybpiper assemble
, they will limit the maximum length of locus sequence that can be recovered. Note that this threshold defaults to0.0
(all sequences retained). -
If sequences with low-complexity regions were detected by
hybpiper check_targetfile
(and hence listed in the control*.ctl
file), remove such sequences from the fixed target file (if not already removed by the previous steps). This behaviour can be controlled by the--keep_low_complexity_sequences
flag.
For a list of output files produced by the command hybpiper fix_targetfile
, see here.
HybPiper can map sample reads to a target file containing nucleotide sequences using BWA, or protein sequences using BLASTx or DIAMOND. While results may vary depending on the dataset and/or target file, our testing has found that using a target file with protein sequences and mapping reads to target sequences using BLASTx (translated nucleotide query and protein database) results in longer contigs for many genes, and the generation of contigs for some samples/genes that had none when using BWA mapping. This can occur even when BLASTx maps fewer reads overall than BWA.
Consequently, we recommend using a target file containing protein sequences and BLASTx mapping.
See below for some example comparisons between BWA and BLASTx/DIAMOND.
Sample 1 (15,758,918 reads, Angiosperms353 target file):
Mapping tool | Total reads mapped | Genes with sequences | Run time * |
---|---|---|---|
BWA | 1,103,583 | 310 | 394 |
BLASTx | 1,304,929 | 340 | 2204 |
DIAMOND** | 1,325,989 | 340 | 1269 |
* 30 threads, AMD EPYC 7601 2700 MHz
** Default sensitivity
For genes where both mapping methods recovered a sequence:
-
Average increase in length of genes for BLASTx vs BWA: ~45% (median = 20%)
-
Average increase in length of genes for DIAMOND vs BWA: ~45% (median = 20%)
Sample 2 (611,894 reads, Angiosperms353 target file):
Mapping tool | Total reads mapped | Genes with sequences | Run time (sec) * |
---|---|---|---|
BWA | 115,784 | 206 | 202 |
BLASTx | 35,338 | 252 | 312 |
DIAMOND** | 31,707 | 249 | 222 |
* 30 threads, AMD EPYC 7601 2700 MHz
** Default sensitivity
For genes where both mapping methods recovered a sequence:
-
Average increase in length of genes for BLASTx vs BWA: ~30% (median = 4%)
-
Average increase in length of genes for DIAMOND vs BWA: ~25% (median = 1%)
As can be seen from the statistics above, BLASTx takes significantly longer than BWA, and DIAMOND appears to provide a good balance between sensitivity and speed.
Note: if you provide a nucleotide target file to hybpiper assemble
via the t_dna
/targetfile_dna
parameter, but you do not provide the flag --bwa
, HybPiper will automatically translate your target file and use BLASTx for reads mapping. It is expected that your nucleotide target file can be translated correctly (i.e. in-frame with no stop codons) from the first codon positions in the forwards frame. For nucleotide target file sequences that either a) were not a multiple of three and hence did not comprise complete codons only; or b) contained stop codons in the translated protein, a warning and list of gene names will be logged to file.
Note: When using BLASTx/DIAMOND, we have found that an eValue of 1e-4 works well, and this is the default. The user can change this using the parameter --evalue
.
Ideally, if you run HybPiper with a target file containing protein-coding sequences comprising exons only, your output *.FNA
and .FAA
sequences should not contain any internal stop codons (of course, a terminal stop codon at an expected position is fine). If HybPiper detects an internal stop codon in one or more of your output sequences, it issues a warning and writes the corresponding gene names to the file <prefix>_genes_with_non_terminal_stop_codons.txt
. This will allow you to check any problematic sequences, and troubleshoot possible causes.
There are three main reasons that HybPiper can produce an output sequence containing an internal stop codon:
-
One or more incorrect sequences in your target file. For example, if a given target file sequence for a protein-coding gene accidentally contains an intron sequence, and HybPiper selects this target file sequence as the 'best' reference for a given gene, the intron will be included in the query sequence during Exonerate searches of SPAdes contigs. As HybPiper runs Exonerate using the
protein2genome
model, it expects the query sequence to comprise exon sequences only; the intron sequence in the query could potentially be aligned with the corresponding intron sequence in the SPAdes contig target, and this region will be extracted as putative exonic coding sequence and possibly incorporated into the final*.FNA
output sequence for this gene. If the intron sequence contains stop codons, so will your output gene sequence. -
The SPAdes assembler has assembled a contig with a stop codon in the middle of exonic sequence, and this exonic sequence aligns well with the target file protein reference in Exonerate alignments. This situation can occur when reads assembled for a given gene (or a subset of the reads) do indeed encode an in-frame stop codon. We have observed that SPAdes can, on occasion, include this stop codon in the assembled contig even if there is a greater number of reads encoding a non-stop codon at this position.
-
Exonerate has 'incorrectly' extended an alignment between the target file protein query and a SPAdes nucleotide target. This can result in non-coding or intron sequence from the SPAdes target contig getting incorporated in to the output sequence for a given gene. Consider the Exonerate alignment below:
941 : ValTyrPheGlyAsnValCysLeuArgPheLeuProValValPheAspIleValIleHisArgT : 962
:!! ||| ! ! ||| !!!!! !|||! ! ! !|||||||||||||||||||||||||
IleLeuPheLeu***ValIlePheLeuPheTyrVal***ValPheAspIleValIleHisArgT
136 : ATTCTGTTTCTTTAAGTGATTTTCCTTTTTTATGTTTAGGTGTTCGACATAGTTATACACAGGT : 199
963 : yrLeuGluIleProProValThrLysSerLeuGluThrLeuLeuGluHisLeuGlyCysLeuTy : 983
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
yrLeuGluIleProProValThrLysSerLeuGluThrLeuLeuGluHisLeuGlyCysLeuTy
200 : ACTTGGAGATACCACCAGTCACAAAATCACTGGAGACATTACTGGAACATTTAGGCTGTCTCTA : 262
984 : rLysPheHisAspArg : 988
||||||||||! !!:!
rLysPheHisGlyLys
263 : TAAATTTCATGGTAAG : 279
Here you can see that the beginning of the alignment between the target file protein and the SPAdes nucleotide target looks dubious, and actually includes stop codons in the target. A closer examination of the target sequence (not shown) reveals that this region of the Exonerate alignment does indeed correspond to an intron sequence, and should not be included in the *.FNA
sequence output by HybPiper.
As of HybPiper version 2.1.1, a new filter has been added to process Exonerate alignments and trim the 5' and 3' termini of SPAdes hits to remove such spurious sequences. Briefly, the new filter applies a sliding window to the Exonerate alignment, and calculates the similarity between the query and target within the window sequence. It then checks if the similarity is above a given threshold, and, if so, no trimming is applied. If the similarity is below the threshold, consecutive sliding windows are tested. Once a sliding window above the threshold is found, the target sequence is trimmed to start at this position (or, more particularly, at the first codon position within the window that is identical to the query, as denoted by |||
). The size of the sliding windows can be adjusted using the parameter --exonerate_hit_sliding_window_size
when running hybpiper assemble
(default size is 3 amino acids), and the similarity threshold is set by the --thresh
value (as also used to initially filter Exonerate alignments at a global level; the default value is 55).
For example, in the alignment above the first sliding window starting at the 5' end contains the region:
ValTyrPhe
:!! |||
IleLeuPhe
ATTCTGTTT
...with a similarity score of 1/3 or 33.3%. As this is below the threshold of 55%, the next sliding window is tested:
TyrPheGly
||| !
LeuPheLeu
CTGTTTCTT
...which also has a similarity score of 33%. This process continues until a window above the threshold is found, in this case:
ValValPhe
!||||||
***ValPhe
TAGGTGTTC
...with a similarity score of 66.6%. The first amino-acid position in the window with an identical match is identified (in this case, the second Val
), and the hit sequence is trimmed to begin at this position. So, for this example Exonerate alignment, the hit sequence will be 5' trimmed to begin at:
GTGTTCGACATAGTTATACACAGGT...etc.
The same process is then applied starting from the 3' end of the Exonerate alignment, and the hit sequence is 3' trimmed if necessary.
IMPORTANT: While this filter can be effective in 'cleaning up' Exonerate alignments prior to generating output sequences, it should be noted that the sensitivity and specificity of trimming is dependent on a combination of sliding window size, threshold, and the specific Exonerate alignment for any given hit. Therefore, different hits (and indeed the two termini of a single hit) might be more accurately trimmed with different sliding window sizes and thresholds. Currently, HybPiper applies a single sliding window size and threshold to all sequences for a given run. If you would like to try different options, you can re-run the hybpiper assemble
command using the --start_from exonerate_contigs
parameter, and vary the --exonerate_hit_sliding_window_size
and --thresh
values as desired.
The raw Exonerate alignment(s) for a given gene can be found in the exonerate_results.fasta
file, located within each gene directory under a subdirectory named after your --prefix
. For example, for gene 5166
from a sample run with --prefix ERR3650077
, the exonerate_results.fasta
file can be found at ERR3650077/5166/ERR3650077/exonerate_results.fasta
(also see here). This file can be opened in any text editor.
To determine which of the Exonerate alignments contain SPAdes contig hit sequence that was included in the output gene *.FNA
file, open the exonerate_stats.tsv
file (found in the same directory) using a spreadsheet program such as Excel, Numbers or Calc. This report contains information on all SPAdes contigs with Exonerate hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by --thresh
. To determine whether a given contig hit was 5' or 3' trimmed using the filter described above, the relevant columns are query_HSP_range_limits_original
and query_HSP_range_limits_trimmed
(coordinates are relative to the amino-acid reference) and/or hit_HSP_range_limits_original
and hit_HSP_range_limits_trimmed
(coordinates are relative to the nucleotide SPAdes contig). Note that the 3-prime_bases_trimmed
column refers to a different trimming stage that occurs later in the pipeline.