-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
aretasg
authored and
aretasg
committed
Jan 27, 2021
0 parents
commit c0c2050
Showing
22 changed files
with
3,356 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
.DS_Store | ||
__pycache__ | ||
foo.py | ||
foo2.py | ||
get_cdrs.py | ||
*.egg-info | ||
.pytest* | ||
*.ipynb | ||
*.csv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# PaCPaC (Paratope and Clonotype Probing and Clustering) | ||
|
||
Python package to probe antibody VH sequences for a paratope/clonotype of interest and/or cluster into groups of similar paratopes/clonotypes | ||
|
||
## Requirements | ||
* [conda](https://docs.conda.io/en/latest/miniconda.html) | ||
|
||
## :rocket: Installation | ||
```bash | ||
git clone https://github.com/aretasg/pacpac.git | ||
conda env create -f environment.yml | ||
conda activate pacpac | ||
pip install ./pacpac | ||
``` | ||
|
||
## :snake: Example usage | ||
```python | ||
import pandas as pd | ||
from pacpac import pacpac | ||
|
||
df = pd.read_csv(<my_data_set.csv>) | ||
df = pacpac.cluster(df, <vh_amino_acid_sequence_column_name>) | ||
df = pacpac.probe(<probe_vh_amino_acid_sequence>, df, <vh_amino_acid_sequence_column_name>) | ||
``` | ||
|
||
## :gem: Features | ||
* Sequence annotations operations by anarci are parallelized with pandarallel. | ||
* Deep learning model Parapred (Liberis et al., 2018) for paratope predictions. | ||
* Clusters using greedy incremental approach. | ||
* Determinism, when clustering, is achieved by sorting the input data set by CDR lengths and amino acid sequence in a descending order. | ||
* Each cluster has a representitive sequence as indicated by a keyword `seed`. | ||
* Clonotyping is done on the amino acid sequence level. Any silent mutations on nucleotide sequence level due to SHM are not taken into an account. | ||
* Paratope clustering provides several clustering options. | ||
|
||
### Clustering options | ||
* If `structural_equivalence` is set to `False` compares CDR lengths when paratyping and assumes that CDRs of the same length always have deletions at the same position - positional equivalence (Richardson et al., 2020). Check e.g. CL-97141 in `Pertussis_SC.csv` for outliers to this assumption. | ||
* When set to `True` structurally equivalence as assigned by IMGT is used. Assumes that CDRs of different lengths can still have similar paratopes/binding and assigns additional scores for similar residues using a scoring system as described by Wong et al., 2020. This option usually results in more sequences clustered (default). | ||
* Additionally, when `ignore_paratope_length_differences=False` the number of paratope residue matches is divided by the longer paratope residue count to be more sensitive to paratope residue count mismatches (default). | ||
* In general, if you want more sequences clustered set both `structural_equivalence` and `ignore_paratope_length_differences` arguments as `True`. | ||
|
||
## Probing and clustering arguments | ||
```python | ||
help(pacpac.cluster) | ||
help(pacpac.probe) | ||
``` | ||
|
||
## :checkered_flag: Benchmarks with 10K VH sequences with 4 conventional CPU cores | ||
| Task | Time (s) | Notes | | ||
| -----------: | ----------------- | :----------: | | ||
| Annotations using anarci | 378 | parallel execution | | ||
| Paratope prediction using parapred | 494 | parallel execution without CPU/GPU speed up for TensorFlow | | ||
| Clonotype clustering | 25 | on amino acid level | | ||
| Paratope clustering | 42 | `structural_equivalence=False` | | ||
| Paratope clustering | 200 | `structural_equivalence=True` | | ||
| Probing | <0.1 | clonotyping & paratyping | | ||
|
||
Annotating the data set and running parapred are performence bottlenecks and can be speed up with more cores and/or CPU/GPU speed up for Tensorflow. | ||
|
||
## :pencil2: Authors | ||
Written by **Aretas Gaspariunas**. Have a question? You can always ask and I can always ignore. | ||
|
||
## References | ||
- Liberis et al., 2018 | ||
- Richardson et al., 2020 | ||
- Wong et al., 2020 | ||
|
||
## :apple: Citing | ||
If you found PaCPaC useful for your work please acknowledge it by citing this repository. | ||
|
||
## License | ||
BSD license. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
name: pacpac | ||
channels: | ||
- defaults | ||
- conda-forge | ||
- bioconda | ||
dependencies: | ||
- anarci=2020.04.23=py_3 | ||
- docopt=0.6.2=py36_0 | ||
- h5py=2.10.0 | ||
- numpy=1.19.2 | ||
- pandas=0.23.4 | ||
- pip=20.2.4=py36_0 | ||
- python=3.6.12 | ||
- tensorflow=1.2.1=py36_0 | ||
- pip: | ||
- keras==2.0.6 | ||
- pandarallel==1.5.1 | ||
- pyfiglet==0.8.post1 |
Empty file.
Oops, something went wrong.