Skip to content

Latest commit

 

History

History
70 lines (51 loc) · 5.43 KB

README.md

File metadata and controls

70 lines (51 loc) · 5.43 KB

Project Links

This repository is part of the CI-SpliceAI software package published in PLOS One.

This is the project to train the models. You may also be interested in the code to use trained models to annotate variants offline, code comparing different tools on variant data, and the website providing online annotation of variants.

Setup

We strongly advise you to install conda and create a new conda environment. This keeps your projects and their dependencies separate. So go ahead and install conda first.

Create environment with GPU support

Our setup was trained on two GTX 1060 TIs with CUDA 11.0 - the versions pinned here work for our architecture; you might need to change cudnn and tensorflow versions (check out https://www.tensorflow.org/install/source_windows#gpu).

conda create --yes -n cis python=3 keras=2.0.5 tensorflow-gpu=1.4.1 cudnn=7.0.5 matplotlib "h5py=<3=mpi*" numpy requests pandas pyfaidx mpi4py tensorboard scikit-learn -c bioconda -c conda-forge

Notes:

  • You can substitute tensorflow-gpu with tensorflow for local development without a GPU.
  • Using newer versions of keras or tensorflow worsened convergence, so careful when upgrading!
  • Do not use h5py version 3 or you won't be able to load h5 models
  • You can remove the "=mpi*" suffix from h5py, however then you can't use the multi-CPU "mpiexec" command later; instead run python directly
  • There are some runtime warnings from mpi4py, which can be ignored.
  • Do not update/unpin keras/tensorflow/cuda, newer versions will either make the model not converge during training or produce bad scores (I don't know why!)

Project Structure

All scripts are to be run in the cis environment (run conda activate cis before running the script).

splice_table.py

This script re-creates the splicing tables checked in the data/ folder. Since all splicing tables are already checked in, you don't need to run it for the default behaviour of collapsed isoforms.

data/splicing_*

There is one train table and two annotation tables. The annotation tables are used in the CI-SpliceAI python module in order to annotate the most significantly affected gene. The train file combines genes in proximity to another to not create disagreeing ground truth files.

create_data.py

This script creates the machine learning data. You need to run this first before training the model. It is recommended to start this with multiprocessing enabled like this (only if you installed h5py with mpi support as suggested above): mpiexec -n <num-processes> python create_data.py

train.py

Trains the model. You need to run create_data.py first (see above)! Index should be 1-5 to train the respective model. Fold is either ALL or TEST; ALL trains on all chromosomes and TEST excludes the test data (see TEST_FOLD in env.py for a list of test chromosomes)

Depending on your distribution, you may need to specify to use tensorflow instead of theanos (run export KERAS_BACKEND=tensorflow; python train.py <index>)

aggregate.py

This aggregates and optimises the five models into one frozen graph that takes the majority vote of the five CNNs. Train your models first. The resulting file, models/ALL/CI-SpliceAI.pb, is the one ready to be used with the CI-SpliceAI python inference module.

By default, the script assumes you want to build an ensemble trained on ALL chromosomes, and packages models with indices 1,2,3,4,5 together. You can change this; running with no parameters is equal to this command

python aggregate.py --models=1,2,3,4,5 --folder models/ALL --output models/ALL/ensemble_frozen.pb

test.py

Tests models trained on TRAIN fold on the remainder of data.

Paralog annotations

Paralog annotations are used only when using a train/test split, to exclude paralogs from the test split. They are not used when training the final models on all chromosomes.

The file data/paralogs_GRCh38.csv was created by Ensembl biomart on 30/03/22. Follow these steps to re-create the file with the newest Ensembl data:

  1. Click this link. If the page loooks wonky you might need to refresh.
  2. Change the left dropwdown to Compressed web file (notify by email)
  3. Ensure that the second dropdown is on CSV
  4. Ensure that Unique results only box is checked
  5. Enter your email address
  6. Click the Go button
  7. New text should have popped up at the bottom telling you that it is created in the background; wait for the email.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.