Frequence Chaos Game Representation with Deep Learning
A web app is available with all the trained models, you just need to upload a fasta file with your sequences
- Sequences and metadata must be downloaded from GISAID after creating an account and accepting the Terms of Use.
- Reference sequence can be downloaded from here.
- List of variant markers for each clade are save in
mutations_reference.json
and can be found here
Before running the snakemake file, make sure to add them to parameters.yaml
PATH_FASTA_GISAID: "path/to/sequences.fasta"
PATH_METADATA: "path/to/metadata.tsv"
PATH_REFERENCE_GENOME: "path/to/reference.fasta"
Create a virtual environment and install packages
python -m venv env
source env/bin/activate
pip install -r requirements.txt
Set parameters for the experiment in parameters.yaml
- See (and include) preprocessing functions at
preprocessing.py
Run
snakemake -p -c1
to visualize a DAG with the rules
snakemake --forceall --dag | dot -Tpdf > dag.pdf
Snakefile runs codes in this order
undersample_sequences.py
-
extract_sequences.py
(extract each undersample sequence in individuals fasta files) -
fasta2fcgr
(generates a npy file with the $k$th-FCGR for each extracted sequence in the previous step) -
split_data.py
(will create a filedatasets.json
with train, validation and test sets) -
train.py
(train the model for the$k$ -mer selected) test.py
-
classification_metrics.py
(computes accuracy, precision, recall and f1-score) -
clustering_metrics.py
(computes Silhouette score, Calinski-Harabaz and Generalized Discrimination Value in the test set) -
plots.py
(generates plot for accuracy and loss in the training and validation sets. Confusion matrix for the test set) -
saliency_map.py
andshap_values.py
(feature importance methods) -
svm_experiment.py
(train a SVM using subsets of relevant kmers chosen by the feature importance methods) -
match_relevant_kmers.py
(match relevant kmers chose by the feature importance methods to the list of marker variants for each clade)
A folder data/
will be created to save all intermediate results:
data/
├── fcgr-6-mer
├── hCoV-19
├── matches
├── plots
├── saliency_map
├── shap_values
├── svm
├── test
└── train