Softwareproject Bioinformatics - Sequence Embedding for Shallow Learners

Researchgroup Bioinformatics and Computational Biology @ University of Vienna

Softwareproject for Bioinformatics by Fabio Pfaehler and Klaus Hartmann-Baruffi

Supervisor: Prof.Dr.Thomas Rattei, Dr.Alexander Pfundner

Ressources used:

BioEmbeddings
ProtTrans
SeqVec
VOGDB - Virus Orthologous Groups : Database of the University of Vienna, Dept. of Microbiology and Ecosystems,
- vog.members.tsv.gz
- vog.faa.tar.gz
used a small subset (VOG00024) as a PoC (Proof of Concept) for applying Machine Learning techniques on the dataset for VOG-classification

Project aims:

Get a small subset of VOGs and protein sequences
compute embedding for sequence
train a scikit-learn classifier
use a workflow management tool (NextFlow, snakemake) for creation of a ML-pipeline
use of a version control system, e.g. GIT :-)

The project includes:

General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, BioEmbeddings,...)
A pipeline which:
- embeds sequences into vector-representations that can be used to train a ML-model
- dimensionality-reduction for representation and visualisation using t-SNE (optional UMAP)
- visualisation for 3D interactive plots

Installation

installation of a jre (java runtime environment), at least version 11 for nextflow:
sudo apt install default-jre
creation of a python environment using venv or Conda:
conda create --name myenv python=3.8
Conda activate myenv
installation of nextflow: (https://github.com/nextflow-io/nextflow)
pip install nextflow (installed nextflow-23.10.1)

Installation notes

This program was developed for Linux machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA).

For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

Dependencies

we have recognized a major dependency of the used frameworks (ProtTrans, SeqVec and bio_embeddings) on the python version used, and the required libraries like torch, allennlp, h5py and so on.

It is strongly recommended to run the program on a CUDA capable environment.

The benchmarks between the 3 different frameworks for embedding are mentioned in the paper Survey of Protein Sequence Embedding Models

What model is a good one?

We were using the bio_embeddings library with following embedders:

ProtTransBertBFDEmbedder()
SeqVecEmbedder()

as a 'PoC' (proof of concept) and for runtime reasons we used it only on the VOG00024.faa sequence of the VOGDB FASTA-Files

Dimensionality reduction

We used two different algorithms for dimensionality reduction with the aim, to visualise the embeddings:

tSNE

For a detailled view on the steps of the process, you can take a look at our jupyter notebook file of the project

Contributors

Alexander Pfunder
Fabio Pfaehler
Klaus Hartmann-Baruffi

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.jupyter/desktop-workspaces		.jupyter/desktop-workspaces
Jupyter_Notebooks		Jupyter_Notebooks
.gitignore		.gitignore
README.md		README.md
SPNotebook.ipynb		SPNotebook.ipynb
VOG24_embedding.h5		VOG24_embedding.h5
embed_with_prottrans.sh		embed_with_prottrans.sh
nextflow		nextflow
pipeline.nf		pipeline.nf
prott5_embedder.py		prott5_embedder.py
quickstart_example.py		quickstart_example.py
shallowlearners.py		shallowlearners.py
tutorial.nf		tutorial.nf
voglist_split.py		voglist_split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Softwareproject Bioinformatics - Sequence Embedding for Shallow Learners

Installation

Installation notes

Dependencies

What model is a good one?

Dimensionality reduction

Contributors

About

Releases

Packages

Contributors 3

Languages

klausHartman/swproject23

Folders and files

Latest commit

History

Repository files navigation

Softwareproject Bioinformatics - Sequence Embedding for Shallow Learners

Installation

Installation notes

Dependencies

What model is a good one?

Dimensionality reduction

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages