GitHub - westlake-repl/Denovo-Pinal: A framework for text-guided protein design

Pinal: Toward De Novo Protein Design from Natural Language

The repository is an official implementation of Pinal: Toward De Novo Protein Design from Natural Language

Quickly try our online server (16B) here

If you have any questions about the paper or the code, feel free to raise an issue!

Environment setup

Create and activate a new conda environment with Python 3.8.

conda create -n pinal python=3.8 --yes
conda activate pinal
pip install -r requirements.txt

Download model weights

We provide a script to download the pre-trained model weights, as shown below. Please download all files and put them in the weights directory, e.g., weights/Pinal/...

huggingface-cli download westlake-repl/Pinal \
                         --repo-type model \
                         --local-dir weights/

Model checkpoints

The weights directory contains 3 models:

Name	Size
SaProt-T	760M
T2struc-1.2B	1.2B
T2struc-15B	15B

Inference with Pinal

Design protein from natural language instruction with only 3 lines of code!

from utils.design_utils import load_pinal, PinalDesign
load_pinal()
res = PinalDesign(desc="Actin.", num=10)
# res is a list of designed proteins, sorted by the probability per token.

The above code will generate 10 de novo designed proteins based on the input description "Actin.", inferred by 1.2B T2struc and SaProt-T. If you want inference with T2struc-15B, you can set the environment variable T2struc_NAME before calling load_pinal(), as shown below.

import os
os.environ["T2struc_NAME"] = "T2struc-15B"

Warning: Inferencing with T2struc-15B requires at least 40GB GPU memory.

Predicting amino acid sequence with SaProt-T

Here, we provide a script for predicting amino acid sequences using natural language, enabling you to specify the desired structure.

from utils.design_utils import SaProtPrepareGenerationInputs, SaProtGeneration, load_SaProtT_and_tokenizers
desc = "Actin."
saprot, saprot_text_tokenizer, saprot_tokenizer = load_SaProtT_and_tokenizers()
structure = "dqdppafakewedfqfwifidtfpdqggqdifgqkkwafpdpppcvppdddridgtvrrvvvvvgtdmdgqdalqagpdpvsvlvvvvcvdcprvnhqqlnheyeyegaapydlvrllsvvccscpvsvhqwyayaylqlllcvlvvdqfawefaaalqwtkiwggdnsdtdnqlididrdhnvlllvllqvvvvvvvdhqddpnssvvssvcqlpqaaadldlvvqvvclvvdqpskdwdqdpvrdididtssrhvslccqcvvvsvvdpdhhslvsnvsslvsddpvrslvhqchyeyaysrvqhhcpqsnsqvsncvvddvphdgdydydnvrncssvssvsplspdpvnpvlidgsvncvvppssvnvvrhd"
SaProtInputDict = SaProtPrepareGenerationInputs([" ".join(list(structure))], desc, saprot_text_tokenizer, saprot_tokenizer)
seq = SaProtGeneration(saprot, SaProtInputDict, saprot_tokenizer)["sequence"]
print(seq)

The above code makes predictions based on Foldseek tokens. If you want to convert a 3D structure file (e.g., .pdb or .mmcif) into Foldseek tokens, you should download the binary file from here and place it in the assets/bin folder. The following code demonstrates how to use it.

from utils.foldseek_utils import get_struc_seq
pdb_path = "assets/8ac8.cif"
# Extract the "A" chain from the pdb file and encode it into a struc_seq
foldseek_seq = get_struc_seq("assets/bin/foldseek", pdb_path, ["A"])["A"][1].lower()
print(f"foldseek_seq: {foldseek_seq}")

Computational evaluation of the de novo designed proteins

For textual alignment, we recommend using ProTrek to calculate the sequence-text similarity score.

For foldability, we recommend using pLDDT and PAE, outputted by Alphafold series or ESMFold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinal: Toward De Novo Protein Design from Natural Language

Environment setup

Download model weights

Model checkpoints

Inference with Pinal

Predicting amino acid sequence with SaProt-T

Computational evaluation of the de novo designed proteins

Other resources

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
checkpoints		checkpoints
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

westlake-repl/Denovo-Pinal

Folders and files

Latest commit

History

Repository files navigation

Pinal: Toward De Novo Protein Design from Natural Language

Environment setup

Download model weights

Model checkpoints

Inference with Pinal

Predicting amino acid sequence with SaProt-T

Computational evaluation of the de novo designed proteins

Other resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages