- Biopython
- Numpy
- Pandas
- Proteinortho (https://gitlab.com/paulklemm_PHD/proteinortho)
- PRANK
Multiple sequence aligner (http://wasabiapp.org/software/prank/) - v.170427
Add environment variableexport PATH="$PATH:/path/to/dir/prank/bin/
- GUIDANCE (http://guidance.tau.ac.il/) - v2.02
GUIDANCE allows using alignment (MAFFT, PRANK, CLUSTALW) as a subprocess - Gblocks (http://gensoft.pasteur.fr/docs/gblocks/0.91b/)
- SWAMP (https://github.com/peterwharrison/SWAMP)
- PAML (Phylogenetic Analysis by Maximum Likelihood)
See --help for help with arguments for any python script, logs are named as scripts with .log extension or just as standard output.
Output directory (with --o or --out option) will be created automatically and hold all output files [absolute path required].
Errors can be found by keyword 'WARNING' in logs.
Broken files are collected either automatically and write in summary file.
If there are any questions, errors, suggestions feel free to contact me via email
[email protected].
Obtain one-to-one ortholog clusters for whole-genome sequences.
- It is necessary for futher analysis to name the species in numbers (1,2,3,...) and to name all associated files in numbers
(1.faa, 2.faa..., 1.gbff, 2.gbff...) - Extract from gff file protein-coding genes and CDS types obtained by methods for RefSeq if needed:
python utilities/form_refseq_gff.py --i /abspath/gff_folder
- Form a table of genes that will be analyzed from .gff files:
python utilities/form_gene_table_from_gff.py --i /abspath/gff_(corrected)_folder
- launch_proteinortho
For example, launch_proteinortho.sh or launch_proteinortho_synteny.sh. Result file will be needed for futher analysis: project_name.proteinortho.tsv or project_name.poff.tsv if use synteny; - form ortologs table
Form a ortologs table with fulfillment of requirements: single-copy orthologs only, group (minimal number of species in group), required species (one target species in relation to which the analysis is made). Result will be recorded to the file with prefix 'formed', for example, formed_project_name.poff.tsv;
python pipeline/get_orthologs_table.py --ortho /path/tothe/project_name.poff.tsv --species 8 --group 6
--required 6
If all of needed species have ncbi annotations, we can extract orthologs from it.
This requires defining variables in pipeline/orthologs_from_annotation.py
script:
- annotation_path_folder (folder with .gff files for every species)
- result_file_path (path to the result .xlsx file that will be written)
Merge or concatenate results, write paths and options into script and run
python pipeline/concatenate_orthologs.py
See example of file data/orthologs_table.tsv
.
Annotation *.gbff
or cds_from_genomic.fna
are needed, genomes files *.fna
are used for additional check. If you do not have one of these options, give path to empty folder. Result: .fna files with sequences and .log files with summary for every set of orthologs. See log in "get_ortho_nuc_seqs.log".
python pipeline/get_ortho_nucleotides.py --ortho /abspath/to/thetable/project_name_formed_orthologs_table.tsv
--gbff /abspath/tothe/gbff_folder/gbff/ --cds /abspath/tothe/cds/cds_refseq/
--genome /abspath/tothe/fnafiles/genomes/ --species 8 --group 6 --out /abspath/tothe/nuc_out_folder/
Sequences will be sorted to subfolders with unique names (group names)
corresponding to set of species in fasta file (sorted in increasing order). For example:
$ cd /abspath/tothe/nuc_out_prank/
$ ls
1.fasta 2.fasta 3.fasta
$ less 1.fasta $ less 25.fasta $ less 34.fasta
>1 >2 >3
ATG... ATG... ATG...
>2 >3 >2
ATG... ATG... ATG...
Result:
$ cd /abspath/tothe/nuc_out_prank/
$ ls */
12/: 23/:
1.fasta 25.fasta
34.fasta
Broken files will be moved to '/abspath/tothe/nuc_out_folder/broken_species_files' folder and written to '/abspath/tothe/nuc_out_folder/broken_species.tsv' (can be replaced to exclude from further analysis). Summary will be written to "get_ortho_nuc_result.xlsx".
python pipeline/get_nucleotides_target_genes.py --t /path/to/orthologs.xlsx --gbff /path/to/folder_gbff_annotations --o /path/to/output_folder
Perform additional check to exclude duplicates within one sample
python utilities/check_duplicates.py --i /abspath/tothe/nuc_out_folder(from_step2.1)/
Produce codon-based nucleotide sequence alignments for all the one-to-one ortholog clusters.
PRANK may be used separetly or as a subprocess of GUIDANCE.
Option --tree isn't adapted to work with groups, therefore it should be used if there is only one group (i.e args in get_orthologs_table.py
group==species) or run separately for each group.
python pipeline/prank_alignment.py --i /abspath/tothe/nuc_out_folder --o /abspath/tothe/nuc_out_prank/ --a t --threads 32
Summary file will be written in 'nuc_out_prank/prank_summary.xlsx'
NOTE: this step can take a lot of computation time.
If you use MAFFT or CLUSTAL (not PRANK) as a subproces of GUIDANCE you should correct --msaProgram
option
in the line
GScour/pipeline/guidance_alignment.py
Line 47 in 4748195
python pipeline/guidance_alignment.py --i /abspath/tothefolder/with_nucseqs/ --o /abspath/tothe/guidance_out/
--exec /abdpath/guidance.v2.02/www/Guidance/guidance.pl --threads 22
The resulting files (alignments .fas only, without auxiliary files) are stored in cleansed folder/abspath/tothe/guidance_out/cleansed/
.
Use parameter --auto for automatic selection of gblocks parameters based on number of sequences for each group or adjust parameters to your needs in the params_string:
GScour/pipeline/gblocks_alignment.py
Line 35 in 7bd2857
python pipeline/gblocks_alignment.py --i /abspath/tothe/nuc_out_prank/ --auto y --exec /abspath/Gblocks_0.91b/Gblocks --threads 2
Phylogenetic trees must be provided for every species group in format as required for PAML.
Name of tree should be the same as name of species folder ('12' -> '12.tree'). Put trees to separate folder.
Step can be skipped to 4.3.1 if the order is known.
python utilities/replace_for_test_order.py --i /abspath/tothe/nuc_out_prank/ --o /abspath/tothe/test_order/
utilities/fasta2paml_get_order.py
, see --help for args. Folder with .order files can be empty, files with right order will be recorded to that folder.
Skip if right order was set in the step 4.2. The script (fasta2paml.py) consists of two stages:
- Converting fasta format nucleotide codon sequences (from input directory) to philip-sequential format (to output directory)
- Converting philip-sequential format to specific philip format required by PAML:
In resulting out_dir: directory of name "group_id" with folders "file_name" with file_name.phy file for PAML. For example:
$ cd /abspath/tothe/nuc_out_prank/
$ ls */
12/: 23/: 12345/:
1.fasta 4055.fasta 2031.fasta
3010.fasta 2.fasta
Result:
$ cd out_dir
$ ls */
12/: 23/: 12345/:
1/: 4055/: 2031/:
1.phy 4055.phy 2031.phy
3010/: 2/:
3010.phy 2.phy
In --i and out --o folders can be the same. Making backups is necessary for further analysis.
python pipeline/fasta2paml.py --i /abspath/tothe/nuc_out_prank/ --o /abspath/tothe/nuc_out_prank/ --species 8 --group 6
See fasta2paml.log
in working directory.
Test PAML (codeml) for knowing right order for sequences to exclude PAML's errors (Some reference to tree file format from PAML manual: "The species can be represented using either their names or their indexes corresponding to the order of their occurrences in the sequence data file.", but there may be some nuances here).
Test can be performed with launching the one ratio PAML model with script 'paml_one_ratio_model.py'
,
see --help for arguments: option --e can be skipped if use codeml from biopython (Bio.Phylo.PAML), use option '--rework y' if want to overwrite existing paml files. This script writes .ctl file and launches one ratio model.
python pipeline/paml_one_ratio_model.py --i /abspath/tothe/nuc_out_prank/ --tree /abspath/folder_trees/ --threads 22
See 'paml_one_ratio.log'
, further testing may be continued in separate item's folders just from command line.
- After having known about the right orders, .order files should be placed to separate folder. Name of .order file should be the same as name of species folder ('12' -> '12.order');
- Launch 'fasta2paml_ordering.py' (can be launched on the backup folder), right order will be set automaticaly:
python pipeline/fasta2paml_ordering.py --i /abspath/tothe/nuc_out_prank/ --order /abspath/folder_orders/
--o /abspath/for_paml/ --species 8 --group 6
See'fasta2paml_ordering.log'
in working directory.
See'broken_files.xlsx'
in output directory.
See launch example above in the step 4.3.1.1.
Sliding window approach SWAMP to mask regions of the alignment with excessive amino acid changes.
- Construct branchcodes for every species group:
- required tree view for automatic build branchcode (items are separated by spaces): '(1, ((2, 3), ((6), 8, 4)));'
- if you have folder with marked trees for paml, you can clean it from label (#1) and insert spaces with sed stream editor:
sed -i 's/ #1//' *
sed -i 's/,/, /g' *
python pipeline/construct_branchcodes.py --i /abspath/tothe/nuc_out_prank/ --t /abspath/folder_trees_clean/
--b /abspath/folder_for_branchcodes/
- Launch SWAMP:
'swamp_script.py'
for python3 environment,'swamp_script_py2.py'
for python2 envoronment;- use modified version of SWAMP executable
GScour/SWAMP_ordered.py
to conserve right order.
python pipeline/swamp_script_py2.py --e /GScour/SWAMP_ordered.py --i /abspath/tothe/nuc_out_prank/
--b /abspath/tothe/branchcodes/ --t 2 --w 20
See stdout and'swamp_log.log'
.
Use global variable'target_dict'
in swamp_script.py if running on individual files is needed:
target_dict[species_folder] = [item_folder1, item_folder2...]
4.8 Perform maximum likelihood (ML) dN/dS analysis to infer positive selection of genes and codons, using codeml from the PAML software package.
Branch-site model
There are two hypothesis:
H0 (The null model for Branch-site model A):
Model A1: model = 2, NSsites = 2, fix_omega = 1, omega = 1
fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated
kappa = 2 * initial or fixed kappa
fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate
omega = 1 * initial or fixed omega, for codons or codon-based AAs
H1 (Alternative model, Model A: model = 2, NSsites = 2, fix_omega = 0 ):
fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated
kappa = 2 * initial or fixed kappa
fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate
omega = 1 * initial or fixed omega, for codons or codon-based AAs
From readme of paml example lysozymeLarge.ctl
:
Alternative hypothesis (branch site model A, with w2 estimated):
model = 2 NSsites = 2 fix_omega = 0 omega = 1.5 (__or any value > 1__)
As the branch-site model is known to cause computational difficulties for the numer-
ical optimization algorithm in PAML, each analysis is conducted three times with
different starting values to ensure that the global peak is found (Statistical Properties of the Branch-Site Test of Positive Selection, Ziheng Yang and Mario dos Reis)
- PAML launch
For masking files (after SWAMP) launch, for example:
python pipeline/masked_paml_branch_site_model.py --timeout 1000 --i /abspath/tothe/for_paml/ --tree /abspath/folder_trees/ --threads 64 --rework y
For files without masking:
python pipeline/paml_branch_site_model.py --timeout 1000 --i /abspath/tothe/for_paml/ --tree /abspath/folder_trees/ --threads 64 --rework y
See --help for args. - Additional check
Runutilities/count_correct_rst_files.py
to check number of correct auxiliary files required for paml analysis.
python pipeline/paml_out_analysis.py (paml_out_analysis_masked.py) --i /abspath/tothe/for_paml/
--log /abspath/tothe/nuc_out_folder/ --required 6 --p 0.05
Results will be written to /abspath/tothe/for_paml/common_sheet.xlsx
, also will be written in every species folder to a file named 'name_of_species_folder.result' and summary to stdout.
There are sheets for every species group in 'common_sheet.xlsx', each with the following columns:
- dN/dS (w) for site classes (0 1 2a 2b) (see PAML manual)
- proportion (proportion of sites that have those omega values, for each site class)
- background w (omega values on the background, for each site class)
- foreground w (omega values on the foreground, for each site class)
- P-value, likelihood from positive vs likelihood from neutral model.
- Take into account the information from (Yang, 2007): "The branch-site test requires a priori specification of the foreground branches. When multiple branches on the tree are tested for positive selection using the same data set, a correction for multiple testing is required (Anisimova and Yang 2007). A simple and slightly conservative procedure is Bonferroni's correction, which means that the individual test for any branch is considered significant at the level α only if the p-value is <α/m, where m is the number of branches being tested using the same data."
- Use
'utilities/adjust_pvalue.py'
script to get FDR or Bonferroni significance information, see --help for args and adjust p-value level if necessarily in lines: https://github.com/ctlab/GScour/blob/ea7787697646523899088ef94ccb1e5f3cb0e935/utilities/adjust_pvalue.py#L24
https://github.com/ctlab/GScour/blob/ea7787697646523899088ef94ccb1e5f3cb0e935/utilities/adjust_pvalue.py#L26
Script 'assembling_results.py'
collects final data from sheets 'summary' for every result common sheet: concatenate and remove duplicates.