This repository is part of the CI-SpliceAI software package published in PLOS One.
This is the project for offline variant annotation using the final models. You may also be interested in the code to train CI-SpliceAI, code comparing different tools on variant data, and the website providing online annotation of variants.
We strongly advise you to install conda and create a new conda environment. This keeps your projects and their dependencies separate. So go ahead and install conda first.
CI-SpliceAI can be run on your CPU or GPU. CPU is much easier to set up, but inference is much slower than using a GPU. If you only need to annotate a few variants (less than 50 in total), don't bother setting up GPU support.
If you don't need GPU support, you can skip the next paragraph.
For experienced users only. GPU is only supported for CUDA devices (i.e. NVIDIA graphic cards).
We suggest you install conda, CUDA and tensorflow-gpu first and ensure that your GPU is being used by tensorflow (by running python and importing tensorflow). You can use a command similar to this:
conda create -n cis_use python=3.8 tensorflow-gpu=2.2.0 cudnn=7.6.5.32=hc0a50b0_1 keras=2.4.3 -c conda-forge
We used this command with CUDA/11.0.
Please activate the new environment and make sure that your GPU is being recognised:
conda activate cis_use
python -c "import keras; from tensorflow.python.client import device_lib; print([x.name for x in device_lib.list_local_devices()])"
If you set-up your GPU correctly, the last line generated by this command should list all your GPUs. If it does not, do not proceed with installation until fixed. Once the GPU shows up, you can proceed with the next section and omit the [cpu] suffix.
If you did not set-up your GPU, you need to create the conda environment first:
conda create -n cis_use python=3.8
conda activate cis_use
Then, install cispliceai like this (remove [cpu] suffix if you set up a GPU in the previous step):
pip install "cispliceai[cpu]"
cis-vcf [-h] [--annotation ANNOTATION] [--input INPUT] [--output OUTPUT] [--distance DISTANCE] [--batch BATCH] [--all] [--outside] [--mask] reference
positional arguments: reference path to the reference fasta file (*.fa, *.fa.gz) optional arguments: -h, --help show this help message and exit --annotation ANNOTATION, -a ANNOTATION annotation table with gene names, defaults to "grch38" (table included). You can specify "grch37" or a path to your own table of the same format. --input INPUT, -i INPUT input VCF; defaults to stdin --output OUTPUT, -o OUTPUT output VCF; defaults to stdout --distance DISTANCE, -d DISTANCE maximum distance from the variant; defaults to 1000 --batch BATCH, -b BATCH maximum input batch size in MB. Be careful to leave enough space for the model and inference process. We recommend to increase this only for GPU processing. Defaults to 10 --all annotate all affected genes/regions, not only the most significant --outside keep nucleotides outside of annotated transcript areas (defined in annotation table); by default outside nucleotides are encoded as N --mask, -m mask events to disregard gains of splice sites and losses of non-splice sites
Annotates an input.vcf
file into an output.vcf
file. Uses GrCh37 coordinates, outputs all genes affected, and masks to canonical losses or non-canonical gains.
cis-vcf -i input.vcf -o output.vcf -a grch37 --all --mask
Please download a human reference genome file (i.e. the latest primary assembly used by GENCODE: GRCh38, GRCh37). You need to unzip it first!
If you want to save space (and decrease performance!), you can use a bgzipped version of the reference genome. To do so, install samtools and biopython (conda install samtools biopython -c bioconda
), then run bgzip -c <file>.fa > <file>.fa.gz
.
By default, the module uses an annotation table of collapsed isoforms of GENCODE on GRCh38. You can change this to GRCh37 by specifying "-a grch37".
You can also provide your own annotation table. Please use a table provided within this module, i.e. the GrCh38 one, as a template. It's a CSV file with a (gene) ID, the chromosome, strand, start and end of the transcript, and comma-separated junction start end end positions. On the forward strand, junction start/end is equivalent to donor/acceptor sites, on the reverse-strand it's the other way round.
You can specify the path to VCF files here. Or, you can omit the parameters pipe them to stdin and stdout cat variants.vcf | cis-vcf <reference> > output.vcf
The algorithm predicts on your REF annotation plus some nucleotides around it. This parameter changes the number of nucleotides predicted to the left and right. (Note: An additional 5,000 nucleotides in either direction is extracted for context but not used for annotation).
This parameter is only useful when using a GPU. We don't recommend changing it for CPU usage.
Each GPU has its own memory limit, so you need to find the optimal batch size yourself. If it's too small, your GPU won't be used efficiently, if it's too high, it will run out of memory.
We suggest finding the optimal value by measuring memory consumption using nvidia-smi
, extrapolate usage to 100%, and try again. The parameter itself is a maximum value in MB, so it will work for different sequence lengths (depending on --distance
and REF/ALT length)/
By default, only the most significant annotation per ALT allele is written to VCF (i.e. the one with the highest delta position). You can add the '--all' flag to instead output all alleles.
This is an experimental feature.
By default, all nucleotides outside of the overlapping annotated region (defined by the --annotation
table) are replaced with N, except when the variant does not intercept with a region. You can instead always retain the sequences.
This parameter masks the delta score: It will remove (i.e. set to zero) all losses of non-splice sites and all gains of splice sites. Therefore only losses of existing splice sites and gains of novel splice sites are annotated.
The VCF output has an INFO column with the following annotations:
ID | Description |
---|---|
ALLELE | The ALT allele from your input |
SYMBOL | Ensembl gene ID, or strand if no gene overlaps |
DS_AG | Delta score (acceptor gain) |
DS_AL | Delta score (acceptor loss) |
DS_DG | Delta score (donor gain) |
DS_DL | Delta score (donor loss) |
DP_AG | Delta position (acceptor gain) |
DP_AL | Delta position (acceptor loss) |
DP_DG | Delta position (donor gain) |
DP_DL | Delta position (donor loss) |
There is a separate repository here comparing CI-SpliceAI with SpliceAI, MMSplice, SQUIRLS, and two MaxEntScan variants.
This tool is very similar to the SpliceAI annotation tool. These are the most important differences:
- Models are trained on collapsed GENCODE isoforms
- The final models are trained on all chromosomes; SpliceAI models are only trained on training chromosomes
- Models are optimised for inference and load/predict faster
- All overlapping genes are analysed, not only the first one
- When no genes overlap, the tool is run on the forward and backward strand
- All variants are annotated, even multi-nucleotide variants
- Better batch inference on GPU
- Masking (
--mask
) is done prior to annotation, allowing the next-highest annotation to be returned rather than capping the output to zero (see this github issue) - Potential differences when filtering GENCODE annotations. SpliceAI's data pipeline was not released and their selection process not disclosed (see this github issue)
The tool is only using one CPU/GPU max. Multi-processing is currently not supported.
Since version 1.1, you can predict the likelihood of each nucleotide being a splice site like so:
from cispliceai.model import CISpliceAI
from cispliceai.data import DNAPreprocessor
model = CISpliceAI()
# you can run multiple sequences in a batch
# you can input sequences of varying length, but it's your responsibility to ensure that the batch fits on memory!
# The sequences here are only an example, you should upload much longer slices to get sensible results
predictions = model.predict([
DNAPreprocessor.onehot_and_pad('CTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACCAGAACCGGGGGCGGCTGGCAGACAAGAGGACAGTCGCCCTGCCTGCC'),
DNAPreprocessor.onehot_and_pad('CTCCTCCTGCCCCACCAGAACCGGGGGCGGCTG'),
])
# Get acceptor and donor scores for the two sequences
acceptor_prob_1 = predictions[0][:, 1]
donor_prob_1 = predictions[0][:, 2]
acceptor_prob_2 = predictions[1][:, 1]
donor_prob_2 = predictions[1][:, 2]
pip install -e "./[cpu]"
pip install build
python -m build
Pull requests are welcome. Please submit bug reports to the git issues system.
This work is licensed under a Creative Commons Attribution 4.0 International License.