Recent development of high accuracy protein structure predictors result in more and more predicted protein structure models being deposited to public databases such as the AlphaFold DB. These large datasets for protein structures leads to hugh hard disk comsumptions. For example, the full AlphaFold DB release in year 2022 has 23 TB of data, which is expected to continuously increase. To address the data storage issue, the PDC package aims to convert full atomic PDB and mmCIF format protein structure models to and from the highly compressed .pdc format, which is specifically designed for AlphaFold predicted protein structures.
make
PDC does not run natively on Windows. However, it can be run on Windows Subsystem for Linux.
Lossless compression:
pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz
Lossy compression:
pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=2
Lossless compression (CA atoms only)
pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=3
Lossy compression (CA atoms only)
pdc AF-P11532-F2-model_v3.pdb.gz AF-P11532-F2-model_v3.pdc.gz -l=4
Uncompress:
pdd AF-P11532-F2-model_v3.pdc.gz AF-P11532-F2-model_v3.pdb.gz
PDC decrease the size of protein coordinate files in PDB or mmCIF format through the following three approaches:
- Removal of repetitive information among different atoms, such as the chain ID and residue index.
- Use int and
char
instead ofstring
to store coordinates and B-factors. Specifically, since xyz and bfactor can be expressed as %8.3f and %6.2f, they are in the range of -999.999 to 9999.999 and -99.99 to 999.99, respectively. This means that they can be expressed as integers in the range of 0 to 10999998 and 0 to 109998, respectively, both of which can be stored by unint32. - Delta encoding: store the difference in coordinate/bfactor from the previous value rather than the actual value, which is can be stored by int16 or int8.
- Under lossy compression mode, store the torsion angles rather than the coordinates.
PDC is specifically designed for protein models in the AlphaFold database. It is not able to convert all information of a PDB or mmCIF file, especially those from the PDB database. In particular,
- Information for small molecule ligands and non-standard residues (ATOM and CONECT) are ignored.
- A file cannot be parsed if there are missing atoms in some residues.
- Only MODEL 1 of in a multi-model structure will be converted.
- For atoms with alternative locations, only atoms with alternative locations ' ' or 'A' will be considered.
- Hydrogens are ignored.
PDC, MMTF, PIC and BinaryCIF are applied to the E coli proteome of AlphaFold DB. The file sizes after gzip compression are shown below.
File format | File size (MB) | Lossless/Lossy |
---|---|---|
CIF | 273 | Lossless |
PDB | 196 | Lossless |
PIC | 163 | Lossy |
BinaryCIF | 143 | Lossless |
MMTF | 76 | Lossless |
PDC | 67 | Lossless |
PDC | 26 | Lossy |
Conversion to/from MMTF was performed by Atomium
import atomium
pdb = atomium.open("input.pdb")
pdb.model.save("compressed.mmtf")
import atomium
pdb = atomium.open("compressed.mmtf")
pdb.model.save("decompressed.pdb")
Conversion to/from PIC was performed by PIC
PIC.py -k input.pdb # compression
PIC.py -dk input # decompression
Conversion to/from BinaryCIF was performed by modelcif
import modelcif.reader
import modelcif.dumper
s=modelcif.reader.read(open("input.cif"),format="mmCIF")
modelcif.dumper.write(open("compressed.bcif",'wb'),s,format="BCIF")
import modelcif.reader
import modelcif.dumper
s=modelcif.reader.read(open("compressed.bcif",'rb'),format="BCIF")
modelcif.dumper.write(open("decompressed.cif",'w'),s,format="mmCIF")
Chengxin Zhang, Anna Marie Pyle (2023) "PDC: a highly compact file format to store protein 3D coordinates." Database: The Journal of Biological Databases and Curation. baad018.