Researchgroup Bioinformatics and Computational Biology @ University of Vienna
Softwareproject for Bioinformatics by Fabio Pfaehler and Klaus Hartmann-Baruffi
Supervisor: Prof.Dr.Thomas Rattei, Dr.Alexander Pfundner
Ressources used:
-
VOGDB - Virus Orthologous Groups : Database of the University of Vienna, Dept. of Microbiology and Ecosystems,
- vog.members.tsv.gz
- vog.faa.tar.gz
-
used a small subset (VOG00024) as a PoC (Proof of Concept) for applying Machine Learning techniques on the dataset for VOG-classification
Project aims:
- Get a small subset of VOGs and protein sequences
- compute embedding for sequence
- train a scikit-learn classifier
- use a workflow management tool (NextFlow, snakemake) for creation of a ML-pipeline
- use of a version control system, e.g. GIT :-)
The project includes:
- General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, BioEmbeddings,...)
- A pipeline which:
- embeds sequences into vector-representations that can be used to train a ML-model
- dimensionality-reduction for representation and visualisation using t-SNE (optional UMAP)
- visualisation for 3D interactive plots
-
installation of a jre (java runtime environment), at least version 11 for nextflow:
-
sudo apt install default-jre
-
creation of a python environment using venv or Conda:
-
conda create --name myenv python=3.8
-
Conda activate myenv
-
installation of nextflow: (https://github.com/nextflow-io/nextflow)
-
pip install nextflow
(installed nextflow-23.10.1)
This program was developed for Linux machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA).
For Windows users, we strongly recommend the use of Windows Subsystem for Linux.
we have recognized a major dependency of the used frameworks (ProtTrans, SeqVec and bio_embeddings) on the python version used, and the required libraries like torch, allennlp, h5py and so on.
It is strongly recommended to run the program on a CUDA capable environment.
The benchmarks between the 3 different frameworks for embedding are mentioned in the paper Survey of Protein Sequence Embedding Models
We were using the bio_embeddings
library with following embedders:
- ProtTransBertBFDEmbedder()
- SeqVecEmbedder()
as a 'PoC' (proof of concept) and for runtime reasons we used it only on the VOG00024.faa
sequence of the VOGDB FASTA-Files
We used two different algorithms for dimensionality reduction with the aim, to visualise the embeddings:
- tSNE
For a detailled view on the steps of the process, you can take a look at our jupyter notebook file of the project
- Alexander Pfunder
- Fabio Pfaehler
- Klaus Hartmann-Baruffi