PDC: Protein Data Compressor

Introduction

Recent development of high accuracy protein structure predictors result in more and more predicted protein structure models being deposited to public databases such as the AlphaFold DB. These large datasets for protein structures leads to hugh hard disk comsumptions. For example, the full AlphaFold DB release in year 2022 has 23 TB of data, which is expected to continuously increase. To address the data storage issue, the PDC package aims to convert full atomic PDB and mmCIF format protein structure models to and from the highly compressed .pdc format, which is specifically designed for AlphaFold predicted protein structures.

Installation

make

PDC does not run natively on Windows. However, it can be run on Windows Subsystem for Linux.

Usage

Lossless compression:

pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz

Lossy compression:

pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=2

Lossless compression (CA atoms only)

pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=3

Lossy compression (CA atoms only)

pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=4

Uncompress:

pdd AF-P11532-F2-model_v3.pdc.gz AF-P11532-F2-model_v3.pdb.gz

Approach and Implementation

PDC decrease the size of protein coordinate files in PDB or mmCIF format through the following three approaches:

Removal of repetitive information among different atoms, such as the chain ID and residue index.
Use int and char instead of string to store coordinates and B-factors. Specifically, since xyz and bfactor can be expressed as %8.3f and %6.2f, they are in the range of -999.999 to 9999.999 and -99.99 to 999.99, respectively. This means that they can be expressed as integers in the range of 0 to 10999998 and 0 to 109998, respectively, both of which can be stored by unint32.
Delta encoding: store the difference in coordinate/bfactor from the previous value rather than the actual value, which is can be stored by int16 or int8.
Under lossy compression mode, store the torsion angles rather than the coordinates.

Limitations

PDC is specifically designed for protein models in the AlphaFold database. It is not able to convert all information of a PDB or mmCIF file, especially those from the PDB database. In particular,

Information for small molecule ligands and non-standard residues (ATOM and CONECT) are ignored.
A file cannot be parsed if there are missing atoms in some residues.
Only MODEL 1 of in a multi-model structure will be converted.
For atoms with alternative locations, only atoms with alternative locations ' ' or 'A' will be considered.
Hydrogens are ignored.

Benchmark

PDC, MMTF, PIC and BinaryCIF are applied to the E coli proteome of AlphaFold DB. The file sizes after gzip compression are shown below.

File format	File size (MB)	Lossless/Lossy
CIF	273	Lossless
PDB	196	Lossless
PIC	163	Lossy
BinaryCIF	143	Lossless
MMTF	76	Lossless
PDC	67	Lossless
PDC	26	Lossy

Conversion to/from MMTF was performed by Atomium

import atomium
pdb = atomium.open("input.pdb")
pdb.model.save("compressed.mmtf")

import atomium
pdb = atomium.open("compressed.mmtf")
pdb.model.save("decompressed.pdb")

Conversion to/from PIC was performed by PIC

PIC.py -k input.pdb # compression
PIC.py -dk input # decompression

Conversion to/from BinaryCIF was performed by modelcif

import modelcif.reader
import modelcif.dumper
s=modelcif.reader.read(open("input.cif"),format="mmCIF")
modelcif.dumper.write(open("compressed.bcif",'wb'),s,format="BCIF")

import modelcif.reader
import modelcif.dumper
s=modelcif.reader.read(open("compressed.bcif",'rb'),format="BCIF")
modelcif.dumper.write(open("decompressed.cif",'w'),s,format="mmCIF")

Reference

Chengxin Zhang, Anna Marie Pyle (2023) "PDC: a highly compact file format to store protein 3D coordinates." Database: The Journal of Biological Databases and Curation. baad018.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
test		test
.gitignore		.gitignore
GeometryTools.hpp		GeometryTools.hpp
LICENSE		LICENSE
Makefile		Makefile
PDBParser.hpp		PDBParser.hpp
README.md		README.md
StringTools.hpp		StringTools.hpp
Superpose.hpp		Superpose.hpp
pdc.cpp		pdc.cpp
pdd.cpp		pdd.cpp
pstream.h		pstream.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDC: Protein Data Compressor

Introduction

Installation

Usage

Approach and Implementation

Limitations

Benchmark

Reference

About

Releases

Packages

Languages

License

shi-scala/pdc

Folders and files

Latest commit

History

Repository files navigation

PDC: Protein Data Compressor

Introduction

Installation

Usage

Approach and Implementation

Limitations

Benchmark

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages