The repository is an official implementation of Pinal: Toward De Novo Protein Design from Natural Language
Quickly try our online server (16B) here
If you have any questions about the paper or the code, feel free to raise an issue!
Create and activate a new conda environment with Python 3.8.
conda create -n pinal python=3.8 --yes
conda activate pinal
pip install -r requirements.txt
We provide a script to download the pre-trained model weights, as shown below. Please download all files and put them in the weights
directory, e.g., weights/Pinal/...
huggingface-cli download westlake-repl/Pinal \
--repo-type model \
--local-dir weights/
The weights
directory contains 3 models:
Name | Size |
---|---|
SaProt-T | 760M |
T2struc-1.2B | 1.2B |
T2struc-15B | 15B |
Design protein from natural language instruction with only 3 lines of code!
from utils.design_utils import load_pinal, PinalDesign
load_pinal()
res = PinalDesign(desc="Actin.", num=10)
# res is a list of designed proteins, sorted by the probability per token.
The above code will generate 10 de novo designed proteins based on the input description "Actin.", inferred by 1.2B T2struc and SaProt-T. If you want inference with T2struc-15B, you can set the environment variable T2struc_NAME
before calling load_pinal()
, as shown below.
import os
os.environ["T2struc_NAME"] = "T2struc-15B"
Warning: Inferencing with T2struc-15B requires at least 40GB GPU memory.
Here, we provide a script for predicting amino acid sequences using natural language, enabling you to specify the desired structure.
from utils.design_utils import SaProtPrepareGenerationInputs, SaProtGeneration, load_SaProtT_and_tokenizers
desc = "Actin."
saprot, saprot_text_tokenizer, saprot_tokenizer = load_SaProtT_and_tokenizers()
structure = "dqdppafakewedfqfwifidtfpdqggqdifgqkkwafpdpppcvppdddridgtvrrvvvvvgtdmdgqdalqagpdpvsvlvvvvcvdcprvnhqqlnheyeyegaapydlvrllsvvccscpvsvhqwyayaylqlllcvlvvdqfawefaaalqwtkiwggdnsdtdnqlididrdhnvlllvllqvvvvvvvdhqddpnssvvssvcqlpqaaadldlvvqvvclvvdqpskdwdqdpvrdididtssrhvslccqcvvvsvvdpdhhslvsnvsslvsddpvrslvhqchyeyaysrvqhhcpqsnsqvsncvvddvphdgdydydnvrncssvssvsplspdpvnpvlidgsvncvvppssvnvvrhd"
SaProtInputDict = SaProtPrepareGenerationInputs([" ".join(list(structure))], desc, saprot_text_tokenizer, saprot_tokenizer)
seq = SaProtGeneration(saprot, SaProtInputDict, saprot_tokenizer)["sequence"]
print(seq)
The above code makes predictions based on Foldseek tokens. If you want to convert a 3D structure file (e.g., .pdb or .mmcif) into Foldseek tokens, you should download the binary file from here and place it in the assets/bin folder. The following code demonstrates how to use it.
from utils.foldseek_utils import get_struc_seq
pdb_path = "assets/8ac8.cif"
# Extract the "A" chain from the pdb file and encode it into a struc_seq
foldseek_seq = get_struc_seq("assets/bin/foldseek", pdb_path, ["A"])["A"][1].lower()
print(f"foldseek_seq: {foldseek_seq}")
For textual alignment, we recommend using ProTrek to calculate the sequence-text similarity score.
For foldability, we recommend using pLDDT and PAE, outputted by Alphafold series or ESMFold.
- ProTrek and its online server
- Evola and its online server
- SaprotHub and its online server