This repository is for portfolio purposes only.
For currently maintained version go to Małopolskie Centrum Biotechnologii repository
Do you have thousands of protein sequences with unknown structures, but still want to know their molecular function, biological process, cellular component and enzyme commission predicted by DeepFRI Graph Convolutional Network?
This is the right project for this task! Pipeline in a nutshell:
- Search for similar target protein sequences using MMseqs2
- Align target protein contact map to fit your query protein with unknown structure
- Run predictions on query sequence combined with aligned target contact map or sequence alone if no alignment was found
- Setup python environment
pip install .
- Install mmseqs2
sudo apt install mmseqs2
- Install boost libraries
sudo apt-get install libboost-numpy1.71 libboost-python1.71
- (optional) Edit
CONFIG/FOLDER_STRUCTURE.py
to customize your folder structurenano CONFIG/FOLDER_STRUCTURE.py
- Run
post_setup.py
script to create folder structure according toFOLDER_STRUCTURE.py
and to download and unzip DeepFRI model weightspython post_setup.py
- Create
YOUR_DATA_ROOT
directory on your local machinemkdir /YOUR_DATA_ROOT
- Docker run!
-u $(id -u):$(id -g)
is used to make sure all files created by pipeline are accessible for usersdocker run -it -u $(id -u):$(id -g) -v /YOUR_DATA_ROOT:/data soliareofastora/metagenomic-deepfri
- Inside docker run
post_setup.py
script to create folder structure and unzip DeepFRI model weightspython post_setup.py
- Upload structure files, for example from PDB, to
STRUCTURE_FILES_PATH
(paths are defined inCONFIG/FOLDER_STRUCTURE.py
) - Create target database
python update_target_mmseqs_database.py --input all
- Upload protein sequences
.faa
files intoQUERY_PATH
- Run main_pipeline.py.
python main_pipeline.py --input all
- Collect results from
FINISHED_PATH
Pipeline is build around folder structure described in CONFIG / FOLDER_STRUCTURE.py
.
Multiple teams can have their separate projects - subdirectories inside STRUCTURE_FILES_PATH
, QUERY_PATH
,
WORK_PATH
and FINISHED_PATH
.
You can execute main_pipeline.py
or update_target_mmseqs_database.py
with --project_name
to easily control the files it touches.
Without specifying --project_name
pipeline will use default
as project name.
Task is a single run of main_pipeline.py
. The task name is the timestamp at the beginning of the scrip run. Its path is WORK_PATH / project_name / timestamp
.
After completion, results will be stored in FINISHED_PATH / project_name / timestamp
.
Target database creation update_target_mmseqs_database.py
works in similar fashion
appending new structures to MMSEQS_DATABASES_PATH / project_name
creating new timestamp folder.
Pipeline will use the database that was most recently created.
TODO Feature to use specific target database timestamp instead of name + the newest timestamp
When running main_pipeline.py
with a new project name,
current state of CONFIG / RUNTIME_PARAMETERS.py
will be saved in WORK_PATH / project_name / project_config.json
and will be used in all upcoming tasks in this project.
Similarly update_target_mmseqs_database.py
. It will store MAX_TARGET_CHAIN_LENGTH
inside target_db_config.json
.
- Upload structure files to
STRUCTURE_FILES_PATH / your_project_name
. - Run
update_target_mmseqs_database.py
script.python update_target_mmseqs_database.py --project_name your_project_name
Main feature of this project is its ability to generate query contact map on the fly
using results from mmseqs2 target database search for similar protein sequences with known structures.
Later in the metagenomic_deepfri.py
contact map alignment is performed to use it as input to DeepFRI GCN.
(implemented in CPP_lib/load_contact_maps.h)
update_target_mmseqs_database.py
script will search for structure files,
process them and store protein chain sequence and atoms positions inside SEQ_ATOMS_DATASET_PATH / project_name
.
It will also create a mmseqs2 database in MMSEQS_DATABASES_PATH / project_name
.
This operation will append new structures to existing ones.
You can also use --input DIR_1 FILE_2 ...
argument list to parse structures from multiple sources.
Both absolute and relative to STRUCTURE_FILES_PATH
.
Use --input .
to parse all structure files inside STRUCTURE_FILES_PATH
.
Accepted formats are: .pdb .cif .ent
both raw and compressed .gz
To add another structure file format edit STRUCTURE_FILES_PARSERS
inside update_target_mmseqs_database.py
target_db_config.json
contains MAX_TARGET_CHAIN_LENGTH
.
This value is copied from CONFIG / RUNTIME_PARAMETERS.py
while creating new target database.
Protein ID is used as a filename. A new protein whose ID already exists in the database will be skipped.
Use --overwrite
flag to overwrite existing sequences and atoms positions.
Also use this argument if you want to apply changes to MAX_TARGET_CHAIN_LENGTH
inside target_db_config.json
- Upload
.faa
files intoQUERY_PATH / your_project_name
(defaultproject_name
isdefault
) - Run main_pipeline.py
python main_pipeline.py --project_name your_project_name
- Upon completion, collect results from
FINISHED_PATH / your_project_name / timestamp
Pipeline will attempt to use project_name
target database name. If it's missing, default target database will be used instead.
If you want to use other target database use its name (project_name used during database creation) in --target_db_name
.
You can use --input DIR_1 FILE_2 ...
argument list to process query .faa
files from multiple sources.
Both absolute and relative to QUERY_PATH
.
Use --input .
to process all query .faa
files inside QUERY_PATH
.
--delete_query
Use this flag so that source query files are deleted from input paths after being copied to project workspace.
--n_parallel_jobs
will divide query protein sequences evenly across all jobs.
Finished folder FINISHED_PATH / project_name / timestamp
will contain:
query_files/*
- directory containing all input query files.mmseqs2_search_results.m8
alignments.json
- results of alignment search implemented inutils.search_alignments.py
metadata*
- files with some useful inforesults*
- multiple files from DeepFRI. Organized by model type ['GCN' / 'CNN'] and its mode ['mf', 'bp', 'cc', 'ec'] for the total of 8 files. Sometimes results from one model can be missing which means that all query proteins sequences were aligned correctly or none of them were aligned.mf = molecular_function bp = biological_process cc = cellular_component ec = enzyme_commission
If you have a suggestion that would make this project better, email me or fork the repo and create a pull request.
main_pipeline.py
add possibility to use specific target_database path and timestamp instead of name onlyutils/search_alignments.py
make some runtime tests, maybe chunkified sequences will perform better with pathos.multiprocessingupdate_target_mmseqs_database.py
add max_target_chain_length argument and inform user if there is difference between this arg and existing target_db_config.jsonupdate_target_mmseqs_database.py
when already processed structures to another project, check if they already exists somewhere
Piotr Kucharski - [email protected]