Skip to content

Troubleshooting, common issues, and recommendations

chrisjackson edited this page Jan 31, 2023 · 35 revisions

Documentation current for HybPiper version 2.1.2


1.0 Target file

1.1 Creating a target file tailored for your study

Bioinformatic sequence recovery for universal target-capture bait kits can be substantially improved by appropriate tailoring of target files to the group under study. If you have poor gene recovery for many samples, you may want to extract the best sequences at each gene and add them to your target file. Rerunning HybPiper with this new set of targets should improve recovery for many samples.

1.1.1 Angiosperms353

Here are some suggestions if you are using the Angiosperms353 Universal Probes for flowering plants:

  • Use the "mega353" curated target file available here.
  • Search for close representative sequences in the Kew PAFTOL Tree of Life Explorer
  • Run HybPiper with distantly related targets and extract coding sequences from the best samples. Add these to the target file and re-run HybPiper for the samples with low recovery.

1.2 Target files with low-complexity sequences: troubleshooting

Target files containing sequences with regions of low complexity can cause issues with HybPiper. This is because:

  1. Many low-complexity sample reads can map to regions of low complexity in the target file sequences. These low-complexity reads are likely to comprise largely off-target reads. Millions of low-complexity reads can map to a low-complexity region in a target sequence.

  2. The SPAdes assembler then attempts to assemble the large, low-complexity read set for the given gene. This can take a long time, and often produces hundreds of low-complexity contigs.

  3. These low complexity contigs are passed on to the exon-extraction stage of the pipeline, during which the 'best' target file protein sequence for a given gene (translated from a DNA sequence if necessary) is aligned with the DNA contigs using the program Exonerate. Again, this can take a very long time if there are many contigs, and produces astronomically large log files if the --verbose_logging flag is used with hybpiper assemble.

To deal with these issues, HybPiper provides:

  • The parameter --timeout_assemble X. If provided, this will kill long-running SPAdes gene assemblies if they take longer than X percent of the average gene assembly time. A running median assembly time is calculated after at least 3 assemblies have been completed. Note that no gene sequence will be produced for the sample if the gene assembly is killed. By default, all assemblies will be allowed to complete with no time limit.

  • The parameter --timeout_exonerate_contigs X. The default timeout for the exon-extraction stage is 120 seconds. After this time, processing of a given gene will be cancelled and no gene sequence will be produced for the sample. HybPiper will log a list of genes that were cancelled due to timeout. In our testing, no gene that did not have hundreds of low complexity SPAdes contigs took longer than 120 seconds, but this number is adjustable using the --timeout_exonerate_contigs parameter if necessary.

STRONGLY RECOMMENDED:

We recommend checking your target file for sequences with low-complexity regions prior to running hybpiper assemble. To assist with this, HybPiper provides the command hybpiper check_targetfile. This will:

  1. Scan along each sequence in your target file using a sliding window of 100 nucleotides (default for a DNA target file) or 50 amino-acids (default for a protein target file). The sliding window size is adjustable using the parameter --sliding_window_size. Within each sliding window, the complexity of the sequence will be calculated using Shannon entropy.

  2. If a sequence within a sliding window falls beneath a given 'complexity threshold' (see below), the sequence is reported as having low-complexity regions.

  3. Write a control file called fix_target_<date_time>.ctl, containing a list of names for sequences with low-complexity regions. This control file can be used as input to the command hybpiper fix_targetfile.

The Shannon entropy is calculated using a base of 2. This means that for a DNA target file containing only ATCG characters (i.e. assuming no N or ambiguity characters), the maximum 'complexity value' achievable is ~2.0. For a protein target file containing only ARNDCQEGHILKMFPSTWYV characters (i.e. assuming no ambiguity characters), the maximum 'complexity value' achievable is ~4.0. The default thresholds used by the hybpiper check_targetfile command are 1.5 (DNA) and 3.0 (protein), but this can be adjusted using the --complexity_minimum_threshold parameter.

Note: Using these defaults, the sequences flagged as having low-complexity regions might differ between a DNA target file and its corresponding protein translation.

We recommend removing flagged sequences from your target file before running hybpiper assemble. To assist with this, HybPiper provides the command hybpiper fix_targetfile.

1.3 Target files with protein-coding nucleotide sequences in the wrong frame

Short summary:

If your target file contains sequences that are closely related to your samples, then having protein-coding nucleotide sequences in your target file that do not begin in the correct frame (i.e. can't be translated into the corresponding protein sequence in the first forwards frame) won't matter much; the 5' and 3' ends of some recovered gene sequences might be slightly truncated. However, if your target file does not contain sequences closely related to your samples, having out-of-frame nucleotide sequences in your target file can matter a lot - you might see much larger truncations in the sequences recovered, or no sequence might be recovered at all for some genes.

Long explanation:

After HybPiper has assembled contigs for a given gene, it extracts coding sequence (that is, exons only) from the contigs via the following process:

  1. Select the 'best' target file sequence for a given gene as a reference protein (translated from the nucleotide sequence in the first forwards frame if a nucleotide target file is provided). This should correspond to the target file sequence that is most similar to a given gene/sample.
  2. Align the reference protein to the nucleotide contigs using the program Exonerate, with an intron-aware model (protein2genome).
  3. Extract and concatenate regions of the contigs identified as exons in the Exonerate search (i.e. regions that, when translated, have good matches to the reference protein; note that Exonerate translates the contigs in all 6 frames when trying to align the protein reference).

Given the above process, issues can arise when the reference target file protein sequence is translated in the wrong frame, and the taxon from which the reference protein originates is somewhat distantly related to the given sample. This is particularly problematic if the reference sequence has been translated from the third wobble position, as these positions are more likely to have mutated at a higher rate (due to reduced positive or purifying selection) and to different nucleotides when comparing the target file sequence and the gene contig. This can often mean that the translated amino-acids differ between the two sequences.

To illustrate this, here's a fragment of an Exonerate alignment between a target file reference protein and a nucleotide contig that's been assembled from sample reads. In this case, the reference protein (top) is closely related to the sample and has been translated in the correct frame:

  20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA :  41
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA
 600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537

...whereas here is the alignment when the reference protein has been translated in the third frame:

  18 : ArgIleTyrTyrAlaThrGlyLeuGlyCysLysArg***LeuProAsnTrpTrpLysCysPheA :  39
       !:!||||||||||||||||||||| !!||| !!|||||||||||||||||||||||| !!||||
       GlnIleTyrTyrAlaThrGlyLeuArgCys***Arg***LeuProAsnTrpTrpLysArgPheA
 604 : CAGATTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCA : 541

As can be seen, the second alignment is less similar and contains stop codons, as expected given the query reference 'protein' (and hence the best contig hit translation) has been translated in the wrong frame. However, if the taxa from which the two sequences originate are fairly closely related, the translation of the nucleotide contig in the third frame can still match the reference 'protein' well enough for Exonerate to produce an alignment, and this results in HybPiper recovering a sequence for this gene.

However, if the target file sequence is less closely related, it doesn't work as well. Here's an Exonerate alignment using the same gene contig but with a more distantly related protein reference (translated in the correct frame):

  20 : PheIleMetProLeuAspLeuGlyAlaLysGlySerCysHisIleGlyGlyAsnIleSerThrA :  41
       |||||||||||||||||||||||||||||||||||||||!!.||||||||||||:!!|||||||
       PheIleMetProLeuAspLeuGlyAlaLysGlySerCysGlnIleGlyGlyAsnValSerThrA
 600 : TTTATTATGCCACTGGACTTAGGTGCTAAAGGTAGCTGCCAAATTGGTGGAAACGTTTCAACAA : 537

This looks fine, but if the reference protein is translated in the third frame (i.e. sequence N*SWLCVGEFKLLCGK*GVYYAA*PGSKR*LSHRWQHFN*RRWPTFHTLRFTSWKCTWS*SCLGRWNCSRYAYYFKERQHWI*SEALIYWK*RLTRNCH*NCNTY), then the similarity falls beneath the threshold Exonerate requires to return an alignment; the result of the Exonerate search is empty, and HybPiper does not recover a sequence for this gene.

To assist with correcting this issue with target files containing nucleotide sequences, HybPiper provides the command hybpiper fix_targetfile. See below.

1.4 Fixing and filtering your target file

As described above, target files can have several issues that adversely effect sequence recovery using HybPiper. To help address these issues, HybPiper provides the commands hybpiper check_targetfile and hybpiper fix_targetfile.

Checking a target file for issues:

For a target file containing nucleotide sequences, the command hybpiper check_targetfile will:

  1. Check that the sequence fasta headers are formatted correctly (i.e. Taxon-geneName), and that no duplicate sequence names are found.
  2. Check that the sequences do not contain any unexpected stop codons (a single terminal stop codon is fine, but this can be changed via the --no_terminal_stop_codons flag).
  3. Check that the sequences are a multiple of three - that is, that they contain putative complete codons only. Note that this process assumes that the sequences correspond to protein coding genes.
  4. Check if the sequences contain low-complexity regions (based on user-adjustable thresholds).
  5. Write a control file with the name fix_target_<date_time>.ctl. This text file will contain a list of names for sequences with low-complexity regions (if found), as well as other important parameters from the hybpiper check_targetfile run.

For a target file containing protein sequences, the command hybpiper check_targetfile will:

  1. Check that the sequence fasta headers are formatted correctly (i.e. Taxon-geneName), and that no duplicate sequence names are found.
  2. Check if the sequences contain low-complexity regions (based on user-adjustable thresholds).
  3. Write a control file with the name fix_target_<date_time>.ctl. This text file will contain a list of names for sequences with low-complexity regions (if any), as well as other important parameters from the hybpiper check_targetfile run.

The control file written by hybpiper check_targetfile can then be used as input to the command hybpiper fix_targetfile.

Fixing issues in a target file:

When running hybpiper fix_targetfile with a target file containing nucleotide sequences, you can optionally provide a fasta file of protein sequences for one or more genes, via the parameter --reference_protein_file. These protein sequences are used as 'external' references when selecting between multiple candidate open reading frames (if present) in the target file sequences, as described below.

The command hybpiper fix_targetfile performs the following steps:

  1. If a sequence contains a single forward open reading frame with no unexpected stop codons, and a corresponding 'external' protein reference is not provided, this reading frame is accepted as the 'correct' frame. The sequence is trimmed to contain only complete codons (if necessary). The corresponding protein translation is then available as an 'internal' reference for this gene, and can be used when selecting between multiple candidate reading frames for other sequences. See below for more details.

  2. If a sequence contains a single candidate forward open reading frame with no unexpected stop codons, and a corresponding 'external' protein reference is provided via the parameter --reference_protein_file, check whether the translation of the candidate frame is less than a maximum allowed distance from the reference. If it is, accept the candidate frame as 'correct'. If not, remove the sequence from the fixed target file.

  3. If a sequence contains multiple candidate forward open reading frames with no unexpected stop codons, check for an 'external' protein reference. If found, select the reading frame with the least distance from the reference. If all candidate frames exceed the maximum allowed distance from the reference, remove the sequence from the fixed target file. If no 'external' protein reference is found, save the candidate frames until all sequences have been processed with the steps above.

  4. For sequences with more than once candidate frame but no 'external' reference, check for an 'internal' reference (from step 1 above). If found, select the reading frame with the least distance from the reference. If all candidate frames exceed the maximum allowed distance from the reference, remove the sequence from the fixed target file. Note that if there is more than one 'internal' reference for a given gene, the longest sequence is used.

  5. Filter all remaining accepted sequences/frames by length, on a per-gene basis, using the --filter_by_length_percentage parameter threshold. If there is more than one representative sequence for a given gene, sequences less than the threshold length of the longest representative sequence will be removed. This can be useful because if such shorter target sequences are selected as the 'best' target file reference for Exonerate searches of assembled contigs when running hybpiper assemble, they will limit the maximum length of locus sequence that can be recovered. Note that this threshold defaults to 0.0 (all sequences retained).

  6. If sequences with low-complexity regions were detected by hybpiper check_targetfile (and hence listed in the control *.ctl file), remove such sequences from the fixed target file (if not already removed by the previous steps). This behaviour can be controlled by the --keep_low_complexity_sequences flag.

When processing a target file containing protein sequences with hybpiper fix_targetfile, the following steps are performed:

  1. Filter all sequences on a per-gene basis, using the --filter_by_length_percentage parameter threshold. If there is more than one representative sequence for a given gene, sequences less than the threshold length of the longest representative sequence will be removed. This can be useful because if such shorter target sequences are selected as the 'best' target file reference for Exonerate searches of assembled contigs when running hybpiper assemble, they will limit the maximum length of locus sequence that can be recovered. Note that this threshold defaults to 0.0 (all sequences retained).

  2. If sequences with low-complexity regions were detected by hybpiper check_targetfile (and hence listed in the control *.ctl file), remove such sequences from the fixed target file (if not already removed by the previous steps). This behaviour can be controlled by the --keep_low_complexity_sequences flag.

For a list of output files produced by the command hybpiper fix_targetfile, see here.

2.0 Read mapping

2.1 BWA vs BLAST/DIAMOND

HybPiper can map sample reads to a target file containing nucleotide sequences using BWA, or protein sequences using BLASTx or DIAMOND. While results may vary depending on the dataset and/or target file, our testing has found that using a target file with protein sequences and mapping reads to target sequences using BLASTx (translated nucleotide query and protein database) results in longer contigs for many genes, and the generation of contigs for some samples/genes that had none when using BWA mapping. This can occur even when BLASTx maps fewer reads overall than BWA.

Consequently, we recommend using a target file containing protein sequences and BLASTx mapping.

See below for some example comparisons between BWA and BLASTx/DIAMOND.


Sample 1 (15,758,918 reads, Angiosperms353 target file):

Mapping tool Total reads mapped Genes with sequences Run time *
BWA 1,103,583 310 394
BLASTx 1,304,929 340 2204
DIAMOND** 1,325,989 340 1269

* 30 threads, AMD EPYC 7601 2700 MHz

** Default sensitivity

For genes where both mapping methods recovered a sequence:

  • Average increase in length of genes for BLASTx vs BWA: ~45% (median = 20%)

  • Average increase in length of genes for DIAMOND vs BWA: ~45% (median = 20%)


Sample 2 (611,894 reads, Angiosperms353 target file):

Mapping tool Total reads mapped Genes with sequences Run time (sec) *
BWA 115,784 206 202
BLASTx 35,338 252 312
DIAMOND** 31,707 249 222

* 30 threads, AMD EPYC 7601 2700 MHz

** Default sensitivity

For genes where both mapping methods recovered a sequence:

  • Average increase in length of genes for BLASTx vs BWA: ~30% (median = 4%)

  • Average increase in length of genes for DIAMOND vs BWA: ~25% (median = 1%)


As can be seen from the statistics above, BLASTx takes significantly longer than BWA, and DIAMOND appears to provide a good balance between sensitivity and speed.

Note: if you provide a nucleotide target file to hybpiper assemble via the t_dna/targetfile_dna parameter, but you do not provide the flag --bwa, HybPiper will automatically translate your target file and use BLASTx for reads mapping. It is expected that your nucleotide target file can be translated correctly (i.e. in-frame with no stop codons) from the first codon positions in the forwards frame. For nucleotide target file sequences that either a) were not a multiple of three and hence did not comprise complete codons only; or b) contained stop codons in the translated protein, a warning and list of gene names will be logged to file.

Note: When using BLASTx/DIAMOND, we have found that an eValue of 1e-4 works well, and this is the default. The user can change this using the parameter --evalue.

3.0 Output sequences

3.1 Sequences containing stop codons

Ideally, if you run HybPiper with a target file containing protein-coding sequences comprising exons only, your output *.FNA and .FAA sequences should not contain any internal stop codons (of course, a terminal stop codon at an expected position is fine). If HybPiper detects an internal stop codon in one or more of your output sequences, it issues a warning and writes the corresponding gene names to the file <prefix>_genes_with_non_terminal_stop_codons.txt. This will allow you to check any problematic sequences, and troubleshoot possible causes.

There are three main reasons that HybPiper can produce an output sequence containing an internal stop codon:

  1. One or more incorrect sequences in your target file. For example, if a given target file sequence for a protein-coding gene accidentally contains an intron sequence, and HybPiper selects this target file sequence as the 'best' reference for a given gene, the intron will be included in the query sequence during Exonerate searches of SPAdes contigs. As HybPiper runs Exonerate using the protein2genome model, it expects the query sequence to comprise exon sequences only; the intron sequence in the query could potentially be aligned with the corresponding intron sequence in the SPAdes contig target, and this region will be extracted as putative exonic coding sequence and possibly incorporated into the final *.FNA output sequence for this gene. If the intron sequence contains stop codons, so will your output gene sequence.

  2. The SPAdes assembler has assembled a contig with a stop codon in the middle of exonic sequence, and this exonic sequence aligns well with the target file protein reference in Exonerate alignments. This situation can occur when reads assembled for a given gene (or a subset of the reads) do indeed encode an in-frame stop codon. We have observed that SPAdes can, on occasion, include this stop codon in the assembled contig even if there is a greater number of reads encoding a non-stop codon at this position.

  3. Exonerate has 'incorrectly' extended an alignment between the target file protein query and a SPAdes nucleotide target. This can result in non-coding or intron sequence from the SPAdes target contig getting incorporated in to the output sequence for a given gene. Consider the Exonerate alignment below:

 941 : ValTyrPheGlyAsnValCysLeuArgPheLeuProValValPheAspIleValIleHisArgT : 962
       :!!   |||  ! ! |||  !!!!! !|||! !  !  !|||||||||||||||||||||||||
       IleLeuPheLeu***ValIlePheLeuPheTyrVal***ValPheAspIleValIleHisArgT
 136 : ATTCTGTTTCTTTAAGTGATTTTCCTTTTTTATGTTTAGGTGTTCGACATAGTTATACACAGGT : 199

 963 : yrLeuGluIleProProValThrLysSerLeuGluThrLeuLeuGluHisLeuGlyCysLeuTy : 983
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       yrLeuGluIleProProValThrLysSerLeuGluThrLeuLeuGluHisLeuGlyCysLeuTy
 200 : ACTTGGAGATACCACCAGTCACAAAATCACTGGAGACATTACTGGAACATTTAGGCTGTCTCTA : 262

 984 : rLysPheHisAspArg : 988
       ||||||||||! !!:!
       rLysPheHisGlyLys
 263 : TAAATTTCATGGTAAG : 279

Here you can see that the beginning of the alignment between the target file protein and the SPAdes nucleotide target looks dubious, and actually includes stop codons in the target. A closer examination of the target sequence (not shown) reveals that this region of the Exonerate alignment does indeed correspond to an intron sequence, and should not be included in the *.FNA sequence output by HybPiper.

As of HybPiper version 2.1.1, a new filter has been added to process Exonerate alignments and trim the 5' and 3' termini of SPAdes hits to remove such spurious sequences. Briefly, the new filter applies a sliding window to the Exonerate alignment, and calculates the similarity between the query and target within the window sequence. It then checks if the similarity is above a given threshold, and, if so, no trimming is applied. If the similarity is below the threshold, consecutive sliding windows are tested. Once a sliding window above the threshold is found, the target sequence is trimmed to start at this position (or, more particularly, at the first codon position within the window that is identical to the query, as denoted by |||). The size of the sliding windows can be adjusted using the parameter --exonerate_hit_sliding_window_size when running hybpiper assemble (default size is 3 amino acids), and the similarity threshold is set by the --thresh value (as also used to initially filter Exonerate alignments at a global level; the default value is 55).

For example, in the alignment above the first sliding window starting at the 5' end contains the region:

ValTyrPhe
:!!   |||
IleLeuPhe
ATTCTGTTT

...with a similarity score of 1/3 or 33.3%. As this is below the threshold of 55%, the next sliding window is tested:

TyrPheGly
   |||  !
LeuPheLeu
CTGTTTCTT

...which also has a similarity score of 33%. This process continues until a window above the threshold is found, in this case:

ValValPhe
  !||||||
***ValPhe
TAGGTGTTC

...with a similarity score of 66.6%. The first amino-acid position in the window with an identical match is identified (in this case, the second Val), and the hit sequence is trimmed to begin at this position. So, for this example Exonerate alignment, the hit sequence will be 5' trimmed to begin at:

GTGTTCGACATAGTTATACACAGGT...etc.

The same process is then applied starting from the 3' end of the Exonerate alignment, and the hit sequence is 3' trimmed if necessary.

IMPORTANT: While this filter can be effective in 'cleaning up' Exonerate alignments prior to generating output sequences, it should be noted that the sensitivity and specificity of trimming is dependent on a combination of sliding window size, threshold, and the specific Exonerate alignment for any given hit. Therefore, different hits (and indeed the two termini of a single hit) might be more accurately trimmed with different sliding window sizes and thresholds. Currently, HybPiper applies a single sliding window size and threshold to all sequences for a given run. If you would like to try different options, you can re-run the hybpiper assemble command using the --start_from exonerate_contigs parameter, and vary the --exonerate_hit_sliding_window_size and --thresh values as desired.

The raw Exonerate alignment(s) for a given gene can be found in the exonerate_results.fasta file, located within each gene directory under a subdirectory named after your --prefix. For example, for gene 5166 from a sample run with --prefix ERR3650077, the exonerate_results.fasta file can be found at ERR3650077/5166/ERR3650077/exonerate_results.fasta (also see here). This file can be opened in any text editor.

To determine which of the Exonerate alignments contain SPAdes contig hit sequence that was included in the output gene *.FNA file, open the exonerate_stats.tsv file (found in the same directory) using a spreadsheet program such as Excel, Numbers or Calc. This report contains information on all SPAdes contigs with Exonerate hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by --thresh. To determine whether a given contig hit was 5' or 3' trimmed using the filter described above, the relevant columns are query_HSP_range_limits_original and query_HSP_range_limits_trimmed (coordinates are relative to the amino-acid reference) and/or hit_HSP_range_limits_original and hit_HSP_range_limits_trimmed (coordinates are relative to the nucleotide SPAdes contig). Note that the 3-prime_bases_trimmed column refers to a different trimming stage that occurs later in the pipeline.

Clone this wiki locally