-
Notifications
You must be signed in to change notification settings - Fork 45
Home
Documentation current for HybPiper version 2.3.1
Welcome to the HybPiper wiki! These documents will cover more in depth installation instructions and tutorials for different ways to use HybPiper.
Detecting putative chimeric loci sequences
Troubleshooting, common issues, and recommendations
HybPiper expects "clean" sequencing reads in FASTQ format. What "clean" means is up to the user, but we recommend Trimmomatic or fastp, which can remove poor quality reads, trim poor quality base calls, and remove adapter sequences. If you have sequencing reads from Illumina NextSeq or Illumina NovaSeq, fastp
is preferred, to remove homodimers from the ends of reads. If you are interested in calculating coverage depth exactly, you may also want to remove PCR duplicates, but this is not necessary for gene recovery.
HybPiper can accept either single-end or paired-end data. You may also include a file containing unpaired reads with the --unpaired
flag in addition to paired-end data. Read files can be in compressed gzip format (suffix .gz
).
It is a good idea to rename your read-files so that they contain a unique identifier for each sample. This will be useful for running HybPiper automatically on many samples. For an example, see the test dataset in the Tutorial.
The figure above summarizes a typical exon enrichment setup that targets coding regions of genes for phylogenetics. The baits (short lines) are tiled across the exons. HybPiper uses the coding sequence portion of a gene for both mapping sequencing reads and for alignment of contigs. When generating a "target file" for HybPiper (see below), exon sequences should be concatenated for each gene. Another way to think of the target sequence is as the mature RNA sequence produced from this gene.
The reason HybPiper treats coding sequences this way is to more accurately detect the exon/intron boundary when aligning the assembled contigs against the reference sequence. HybPiper will align assembled contigs to the full gene sequence and will separate exon sequences from intron sequences.
HybPiper uses target sequences to first sort the reads and then extract exon sequences from assembled contigs, for each gene separately. In order to run HybPiper, you will need to prepare a "target file", which is a FASTA file that contains one or more homologous sequences for each gene.
If you are using HybPiper on data generated using HybSeq, the target file should contain the full-length reference sequences for each gene, NOT the probe sequences.
IMPORTANT: if you are using BLAST or DIAMOND to map reads to targets, you need a PROTEIN target file. If you are using BWA, you need a NUCLEOTIDE target file. If you supply a nucleotide target file but omit the flag --bwa
, HybPiper will translate your target file and map reads to targets using BLAST.
Construct a "target file" of protein sequences in FASTA format. It is ok to have more than one "source sequence" per target (e.g. orthologs from different species).
The ID for each sequence should include the target source and the protein ID, separated by a hyphen. The protein ID should not include single or double quotation marks (i.e. '
or "
). For example:
>Amborella-atpH
MNPLISAASVIAAGLAVGLASIGPGVGQGTAAGQAVEGIARQPEAEGKIRGTLLLSLAFM
>Aneura-atpH
MNPLIPAASVIAAGLAVGLASIGPGIGQGTAAGQAVEGIARQPEAEGKIRGTLLSSPASM
>Amborella-rbcL
MSPKTETKASAGFKAGVKDYRLTYYTPDYETLATDILAAFRVTPQPGVPPEEAGAAVAAE
>Aneura-rbcL
MSPQTETKAGVGFKAGVKDYRLTYYTPEYETKETDILAAFRMTPQPGVPPEEAGAAVAAE
The construction of the target file can have a big impact on gene recovery. For a larger discussion on constructing target files, see here.
If you previously used HybPiper 1.3 or earlier, use the table below to see how the older script names correspond to the new subcommands in Hybpiper 2.0 and newer.
Command description | HybPiper 2.0 command | Old HybPiper 1.3 command/script |
---|---|---|
Assemble contigs and extract sequences | hybpiper assemble |
reads_first.py |
Extract intron and/or supercontig sequences | hybpiper assemble --run_intronerate |
intronerate.py |
Get run statistics | hybpiper stats |
get_seq_lengths.py , hybpiper_stats.py
|
Clean up intermediate files | Not needed, now run automatically. If desired, SPAdes assembly folders and other intermediate files can be retained via hybpiper assemble --keep_intermediate_files
|
cleanup.py |
Recover gene (nucleotide or amino-acid), intron or supercontig sequences | hybpiper retrieve_sequences |
retrieve_sequences.py |
Get gene recovery heatmap | hybpiper recovery_heatmap |
gene_recovery_heatmap.R , gene_recovery_heatmap_ggplot.R
|
Recover paralog sequences | hybpiper paralog_retriever |
paralog_investigator.py , paralog_retriever.py
|
For full details on parameters available for each command, please see Full Pipeline Parameters.
For a full tutorial using a test dataset, check out the Tutorial wiki.
HybPiper is run separately for each sample (single or paired-end sequence reads, with an optional file of unpaired reads in the latter scenario). When HybPiper generates sequence files from the reads, it does so in a standardized directory hierarchy. Many of the post-processing scripts rely on this directory hierarchy, so do not modify it after running the initial pipeline. It is a good idea to run the pipeline for each sample from the same directory. You will end up with one directory per run of HybPiper, and some of the later commands take advantage of this predictable directory structure.
To execute the entire pipeline, create a directory containing the paired-end (in this example) read files. The command hybpiper assemble
will create a directory based on the fastq filenames (or use the name provided via the --prefix
flag):
Anomodon-rostratus_L0001_R1.fastq
---> Anomodon-rostratus/
The following command will execute the entire pipeline on a pair of Illumina read files, using the targets in the file targets.fasta
:
hybpiper assemble -r MySpecies_R1.fastq MySpecies_R2.fastq -t_aa targets.fasta
The BLASTx version of the pipeline (default) will map the reads to amino-acid target sequences. Although it is slower than the BWA version, it may have higher specificity. Reads may not align to divergent nucleotide target sequences, which are required for the BWA version. You can also use DIAMOND as a faster alternative to BLAST by using the flag --diamond
.
If you find the recovery efficiency is poor with BWA, you may want to try again with BLASTx or DIAMOND. See here.
NOTE: In cases where HybPiper recovers sequence for multiple non-contiguous segments of a gene, the gaps between the segments will be padded with a number of 'N' characters. The number of Ns corresponds to the number of amino acids in the 'best' protein reference for that gene that do not have corresponding SPAdes contig hits, multiplied by 3 to convert to nucleotides. See this issue for some more details. You can turn off this padding with the flag --no_pad_stitched_contig_gaps_with_n
.
Optional commands after running the pipeline for multiple assemblies. NOTE: for these commands to work, the files must be in the same directory hierarchy created by the pipeline. (i.e. SampleName/GeneName/SampleName/sequences/FAA/
and SampleName/GeneName/SampleName/sequences/FNA/
).
This command will summarize the recovery of genes from multiple samples. If you have all of these separate runs of hybpiper assemble
in the same directory, create a namelist.txt
file that contains a list of all the HybPiper directories for each sample (one per line), e.g.:
Sample1
Sample2
Sample3
Specify the location of the target file and whether it is amino acid or nucleotide on the command line. Specify the namelist.txt
file and the type of sequence you want to recover statistics for (gene
or supercontig
):
Example Command Line:
hybpiper stats -t_aa test_targets.fasta gene namelist.txt
The command will write a table in *.tsv
format (default file name is seq_lengths.tsv
, but this can be changed using the --seq_lengths_filename
parameter). The first line is a header containing the gene names. The second line contains the length of the reference for each gene. If there are multiple reference sequences for each gene, an average value is reported. Note: even if a target file containing amino-acid sequences was used, the values will correspond to length in nucleotides. The remaining lines are the lengths recovered by hybpiper assemble
for each sample, with one sample per line (one column per gene). If the gene is missing, a 0 is present.
If gene
sequences are requested, a warning will print to stderr
if any sequences are longer than 1.5x the average reference length for that gene. This warning will not be printed in supercontig
sequences are requested, as these are expected to be substantially longer than the refernce length in some cases, due to introns.
Example output:
Species | rbcL | trnL |
---|---|---|
MeanLength | 3252.0 | 1821.6 |
funaria | 3660 | 1758 |
timmia | 3057 | 1821 |
The hybpiper stats
command will then summarize target enrichment and gene recovery efficiency for a set of samples. The output is a table in *.tsv
format (default file name is hybpiper_stats.tsv
, but this can be changed using the --stats_filename
parameter) with one sample per line and the following statistics:
ColumnName | Description |
---|---|
Name | Sample name corresponding to the parameter --prefix or generated from read files. |
NumReads | Total number of input reads in the *.fastq files provided. |
ReadsMapped | Total number of input reads that mapped to sequences in the target file. |
PctOnTarget | Percentage of reads in the input *.fastq files that mapped to sequences in the target file. |
GenesMapped | Number of genes in the target file that had reads mapped to their representative sequence(s). |
GenesWithContigs | From genes with mapped reads, the number of genes for which SPAdes assembled contigs. |
GenesWithSeqs | From genes with SPAdes contigs, the number of genes that had coding sequences extracted after Exonerate searches of contigs vs the target file reference sequence. |
GenesAt25pct | Number of genes with sequences > 25% of the mean target length. |
GenesAt50pct | Number of genes with sequences > 50% of the mean target length. |
GenesAt75pct | Number of genes with sequences > 75% of the mean target length. |
GenesAt150pct | Number of genes with sequences > 150% of the mean target length. |
ParalogWarningsLong | Number of genes that had more than one coding sequence extracted from different contigs by Exonerate searches, each covering more than 75% (default) the length of the selected target file reference sequence. |
ParalogWarningsDepth | Number of genes where the coverage depth of coding sequences extracted by Exonerate was >1 for 75% (default) the length of the selected target file reference sequence. Coding sequences can be of any length. |
GenesWithoutStitchedContigs | Number of genes with sequence derived from a single SPAdes contig. |
GenesWithStitchedContigs | Number of genes with sequence derived from multiple SPAdes contigs. |
GenesWithStitchedContigsSkipped | Number of genes for which stitched contig creation was skipped (i.e. if the flag --no_stitched_contig was provided to the command hybpiper assemble ). |
TotalBasesRecovered | Total number of nucleotides summed from all gene sequences, not including any N characters. |
This command takes the sequence length table output of hybpiper stats
and creates a figure to visualize the recovery efficiency.
Example command line:
hybpiper recovery_heatmap seq_lengths.tsv
By default, the command saves the figure as a *.png
file at 300 DPI. This can be changed via the parameters --heatmap_filetype
and --heatmap_dpi
, respectively. For full command options, use hybpiper recovery_heatmap -h
.
The output will look something like this:
Each row shows a sample, and each column is a gene. The amount of shading in each box corresponds to the length of the gene recovered for that sample, relative to the length of the reference (target coding sequence), or mean reference length if there is more than one reference for a given gene.
In this case, there are a few samples for which few or no genes were recovered (white boxes in rows) and a few genes that had poor recovery across samples (white and light gray).
This command fetches either:
- The sequences recovered from the same gene for all samples; generates an unaligned multi-FASTA file for each gene.
- The sequences recovered from each gene for a single sample; generates a single multi-FASTA file.
Make sure you have all the runs in the same directory. Gene names will be retrieved from the target file used in the run of the pipeline.
Example command line (all samples):
hybpiper retrieve_sequences -t_aa targets.fasta dna --sample_names namelist.txt
You must specify whether you want the protein (aa
) or nucleotide (dna
) sequences.
If you ran intronerate
on these samples via supplying the flag --run_intronerate
to the command hybpiper assemble
, you can also specify "supercontig
" or "intron
" to recover those sequences instead.
By default, hybpiper retrieve_sequences
will output unaligned fasta files (one per gene) to the current directory. Alternatively, an output directory can be specified using the --fasta_dir
parameter, e.g. --fasta_dir 01_dna_seqs
You can also filter which samples genes will be recovered from based on columns in the hybpiper_stats.tsv
file (output by the command hybpiper stats
), via the --filter_by
parameter. See hybpiper retrieve_sequences -h
for details.
Example Command Line (single sample):
hybpiper retrieve_sequences.py -t_aa targets.fasta dna --single_sample_name sample_name
As for above, you can retrieve protein (aa
) or nucleotide (dna
) sequences, as well as intron
and supercontig
sequences if Intronerate was run.
This command filters gene/sample combinations based on length filters for each gene. Two filters are available: minimum length, and percentage of the mean length of the target file sequences for a given gene. The command assumes you have already run hybpiper stats
and hybpiper retrieve_sequences
as input.
Note: As of HybPiper 2.1.6, the hybpiper retrieve_sequences
command can filter sequences at a "whole project" level - for example, removing a sample if fewer than 20% of genes were recovered. The hybpiper filter_by_length
offers a more granular filtering.
Suggested Workflow:
- Run
hybpiper stats
to generate thehybpiper_stats.tsv
andseq_lengths.tsv
files. - Run
hybpiper retrieve_sequences
to create a folder of FASTA sequences. - Run the
hybpiper filter_by_length
command to create new FASTA files based on the per-gene filters. This command also writes a per-genedenylist
to file.
The input FASTA sequences should have the naming scheme used by HybPiper:
-
geneName.FNA
for nucleotide exon files -
geneName.FAA
for amino-acid files -
geneName_supercontig.fasta
for supercontig files -
geneName_intron.fasta
for intron-only files
The gene names will be extracted from either the hybpiper_stats.tsv
file (provided with the option --seq_lengths_file
) or a supplied list of gene/sample combinations (option --denylist
; also produced by running this script).
There are two filters, --length_filter
for the minimum length to accept a sequence (for all genes) and --percent_filter
for a fraction of the mean length determined from the seq_lengths.tsv
file for each gene. For example:
hybpiper filter_by_length dna --seq_lengths_file seq_lengths.tsv --length_filter 90 --percent_filter 0.80
If you wish to filter intron or supercontig sequences, run hybpiper filter_by_length
a second time with the --denylist
option to skip the filtering based on lengths or percentages:
hybpiper filter_by_length dna --denylist denylist.txt
In many HybSeq target capture designs, care is taken to avoid enrichment for genes with paralogous sequences in the target genomes. However, gene duplication and paleopolyploidy (especially in plants) makes it difficult to completely avoid paralogs. Given enough read depth, it may be possible to identify paralogous sequences with HybPiper.
If the SPAdes assembler generates multiple contigs that contain coding sequences representing >=75% (default value) of the length of the reference protein, HybPiper will print a warning for that gene. It will also write the names of the contigs in <prefix>/<gene_name>/<prefix>/paralog_warning_long.txt
, and will write a collated list of all genes with paralog warnings to <prefix>/<prefix>_genes_with_long_paralog_warnings.txt
.
If many of your genes have paralogs, one approach could be to add the paralog coding sequence to your target file as a separate gene, and re-run HybPiper. Reads that have a better mapping to the paralog will be sorted accordingly.
For more in-depth discussion about paralogs, see the examples in the paralog tutorial.
The major steps of the pipeline include:
-
Blast, DIAMOND, or BWA search of the reads against the target sequences.
-
Distribution of reads into separate directories (one per gene).
-
Assembly of reads for each gene into contigs using SPAdes.
-
Conduct an Exonerate search using the protein sequence of the 'best' target reference as the query, and the SPAdes nucleotide contigs as targets. If multiple contigs match the same protein (correcting for overlapping or fully-nested contigs), stitch the hits together into a "stitched contig". Alternatively, if the option
--not_protein_coding
is supplied tohybpiper assemble
, use the 'best' untranslated target reference as the query, and extract sequence hits with BLASTn rather than Exonerate. -
In a subdirectory, generate separate FASTA files containing either the nucleotide (FNA) or amino acid (FAA) sequence for each protein.
-
In a subdirectory, generate a FASTA file containing any 'long' paralog sequence present for a given gene/sample.
-
If the flag
--run_intronerate
was provided to thehybpiper assemble
command, generate separate FASTA files containing the intron or supercontig (exons and introns) sequences for each protein (if present for a given gene/sample).