Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
aretasg authored and aretasg committed Jan 27, 2021
0 parents commit c0c2050
Show file tree
Hide file tree
Showing 22 changed files with 3,356 additions and 0 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.DS_Store
__pycache__
foo.py
foo2.py
get_cdrs.py
*.egg-info
.pytest*
*.ipynb
*.csv
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# PaCPaC (Paratope and Clonotype Probing and Clustering)

Python package to probe antibody VH sequences for a paratope/clonotype of interest and/or cluster into groups of similar paratopes/clonotypes

## Requirements
* [conda](https://docs.conda.io/en/latest/miniconda.html)

## :rocket: Installation
```bash
git clone https://github.com/aretasg/pacpac.git
conda env create -f environment.yml
conda activate pacpac
pip install ./pacpac
```

## :snake: Example usage
```python
import pandas as pd
from pacpac import pacpac

df = pd.read_csv(<my_data_set.csv>)
df = pacpac.cluster(df, <vh_amino_acid_sequence_column_name>)
df = pacpac.probe(<probe_vh_amino_acid_sequence>, df, <vh_amino_acid_sequence_column_name>)
```

## :gem: Features
* Sequence annotations operations by anarci are parallelized with pandarallel.
* Deep learning model Parapred (Liberis et al., 2018) for paratope predictions.
* Clusters using greedy incremental approach.
* Determinism, when clustering, is achieved by sorting the input data set by CDR lengths and amino acid sequence in a descending order.
* Each cluster has a representitive sequence as indicated by a keyword `seed`.
* Clonotyping is done on the amino acid sequence level. Any silent mutations on nucleotide sequence level due to SHM are not taken into an account.
* Paratope clustering provides several clustering options.

### Clustering options
* If `structural_equivalence` is set to `False` compares CDR lengths when paratyping and assumes that CDRs of the same length always have deletions at the same position - positional equivalence (Richardson et al., 2020). Check e.g. CL-97141 in `Pertussis_SC.csv` for outliers to this assumption.
* When set to `True` structurally equivalence as assigned by IMGT is used. Assumes that CDRs of different lengths can still have similar paratopes/binding and assigns additional scores for similar residues using a scoring system as described by Wong et al., 2020. This option usually results in more sequences clustered (default).
* Additionally, when `ignore_paratope_length_differences=False` the number of paratope residue matches is divided by the longer paratope residue count to be more sensitive to paratope residue count mismatches (default).
* In general, if you want more sequences clustered set both `structural_equivalence` and `ignore_paratope_length_differences` arguments as `True`.

## Probing and clustering arguments
```python
help(pacpac.cluster)
help(pacpac.probe)
```

## :checkered_flag: Benchmarks with 10K VH sequences with 4 conventional CPU cores
| Task | Time (s) | Notes |
| -----------: | ----------------- | :----------: |
| Annotations using anarci | 378 | parallel execution |
| Paratope prediction using parapred | 494 | parallel execution without CPU/GPU speed up for TensorFlow |
| Clonotype clustering | 25 | on amino acid level |
| Paratope clustering | 42 | `structural_equivalence=False` |
| Paratope clustering | 200 | `structural_equivalence=True` |
| Probing | <0.1 | clonotyping & paratyping |

Annotating the data set and running parapred are performence bottlenecks and can be speed up with more cores and/or CPU/GPU speed up for Tensorflow.

## :pencil2: Authors
Written by **Aretas Gaspariunas**. Have a question? You can always ask and I can always ignore.

## References
- Liberis et al., 2018
- Richardson et al., 2020
- Wong et al., 2020

## :apple: Citing
If you found PaCPaC useful for your work please acknowledge it by citing this repository.

## License
BSD license.
18 changes: 18 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: pacpac
channels:
- defaults
- conda-forge
- bioconda
dependencies:
- anarci=2020.04.23=py_3
- docopt=0.6.2=py36_0
- h5py=2.10.0
- numpy=1.19.2
- pandas=0.23.4
- pip=20.2.4=py36_0
- python=3.6.12
- tensorflow=1.2.1=py36_0
- pip:
- keras==2.0.6
- pandarallel==1.5.1
- pyfiglet==0.8.post1
Empty file added pacpac/__init__.py
Empty file.
Loading

0 comments on commit c0c2050

Please sign in to comment.