init

aretasg · Jan 27, 2021 · c0c2050 · c0c2050
commit c0c2050
Show file tree

Hide file tree

Showing 22 changed files with 3,356 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,9 @@
+.DS_Store
+__pycache__
+foo.py
+foo2.py
+get_cdrs.py
+*.egg-info
+.pytest*
+*.ipynb
+*.csv
diff --git a/README.md b/README.md
@@ -0,0 +1,71 @@
+# PaCPaC (Paratope and Clonotype Probing and Clustering)
+
+Python package to probe antibody VH sequences for a paratope/clonotype of interest and/or cluster into groups of similar paratopes/clonotypes
+
+## Requirements
+* [conda](https://docs.conda.io/en/latest/miniconda.html)
+
+## :rocket: Installation
+```bash
+git clone https://github.com/aretasg/pacpac.git
+conda env create -f environment.yml
+conda activate pacpac
+pip install ./pacpac
+```
+
+## :snake: Example usage
+```python
+import pandas as pd
+from pacpac import pacpac
+
+df = pd.read_csv(<my_data_set.csv>)
+df = pacpac.cluster(df, <vh_amino_acid_sequence_column_name>)
+df = pacpac.probe(<probe_vh_amino_acid_sequence>, df, <vh_amino_acid_sequence_column_name>)
+```
+
+## :gem: Features
+* Sequence annotations operations by anarci are parallelized with pandarallel.
+* Deep learning model Parapred (Liberis et al., 2018) for paratope predictions.
+* Clusters using greedy incremental approach.
+* Determinism, when clustering, is achieved by sorting the input data set by CDR lengths and amino acid sequence in a descending order.
+* Each cluster has a representitive sequence as indicated by a keyword `seed`.
+* Clonotyping is done on the amino acid sequence level. Any silent mutations on nucleotide sequence level due to SHM are not taken into an account.
+* Paratope clustering provides several clustering options.
+
+### Clustering options
+* If `structural_equivalence` is set to `False` compares CDR lengths when paratyping and assumes that CDRs of the same length always have deletions at the same position - positional equivalence (Richardson et al., 2020). Check e.g. CL-97141 in `Pertussis_SC.csv` for outliers to this assumption.
+* When set to `True` structurally equivalence as assigned by IMGT is used. Assumes that CDRs of different lengths can still have similar paratopes/binding and assigns additional scores for similar residues using a scoring system as described by Wong et al., 2020. This option usually results in more sequences clustered (default).
+* Additionally, when `ignore_paratope_length_differences=False` the number of paratope residue matches is divided by the longer paratope residue count to be more sensitive to paratope residue count mismatches (default).
+* In general, if you want more sequences clustered set both `structural_equivalence` and `ignore_paratope_length_differences` arguments as `True`.
+
+## Probing and clustering arguments
+```python
+help(pacpac.cluster)
+help(pacpac.probe)
+```
+
+## :checkered_flag: Benchmarks with 10K VH sequences with 4 conventional CPU cores
+| Task | Time (s) | Notes |
+| -----------: | ----------------- | :----------: |
+| Annotations using anarci | 378 | parallel execution |
+| Paratope prediction using parapred | 494 | parallel execution without CPU/GPU speed up for TensorFlow |
+| Clonotype clustering | 25 | on amino acid level |
+| Paratope clustering | 42 | `structural_equivalence=False` |
+| Paratope clustering | 200 | `structural_equivalence=True` |
+| Probing | <0.1 | clonotyping & paratyping |
+
+Annotating the data set and running parapred are performence bottlenecks and can be speed up with more cores and/or CPU/GPU speed up for Tensorflow.
+
+## :pencil2: Authors
+Written by **Aretas Gaspariunas**. Have a question? You can always ask and I can always ignore.
+
+## References
+- Liberis et al., 2018
+- Richardson et al., 2020
+- Wong et al., 2020
+
+## :apple: Citing
+If you found PaCPaC useful for your work please acknowledge it by citing this repository.
+
+## License
+BSD license.
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,18 @@
+name: pacpac
+channels:
+  - defaults
+  - conda-forge
+  - bioconda
+dependencies:
+  - anarci=2020.04.23=py_3
+  - docopt=0.6.2=py36_0
+  - h5py=2.10.0
+  - numpy=1.19.2
+  - pandas=0.23.4
+  - pip=20.2.4=py36_0
+  - python=3.6.12
+  - tensorflow=1.2.1=py36_0
+  - pip:
+    - keras==2.0.6
+    - pandarallel==1.5.1
+    - pyfiglet==0.8.post1
diff --git a/pacpac/__init__.py b/pacpac/__init__.py