Skip to content

Commit

Permalink
Merge pull request #18 from RabadanLab/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
roseorenbuch authored Sep 12, 2019
2 parents 450e12c + 301085e commit 4881752
Show file tree
Hide file tree
Showing 11 changed files with 461 additions and 102 deletions.
54 changes: 38 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
arcasHLA performs high resolution genotyping for HLA class I and class II genes from RNA sequencing, supporting both paired and single-end samples.

### Dependencies ###
arcasHLA requires the following utilities:
- [Git Large File Storage](https://github.com/git-lfs/git-lfs/wiki/Installation)
- coreutils

Make sure the following programs are in your `PATH`:
- [Samtools v1.19](http://www.htslib.org/)
- [bedtools v2.27.1](http://bedtools.readthedocs.io/)
Expand All @@ -16,8 +20,6 @@ arcasHLA requires the following Python modules:
- SciPy
- Pandas

arcasHLA also requires coreutils.

### Test ###
In order to test arcasHLA partial typing, we need to roll back the reference to an earlier version. First, fetch IMGT/HLA database version 3.24.0:
```
Expand Down Expand Up @@ -55,16 +57,10 @@ Expected output in `test/output/test.partial_genotype.json`:
"DQB1": ["DQB1*06:04:01", "DQB1*02:02:01"],
"DRB1": ["DRB1*03:02:01", "DRB1*14:02:01"]}
```
Before further usage, remember to update to the 3.34.0.
```
./arcasHLA reference --version 3.34.0
Remember to update the HLA reference using the following command.
```
At this time, cloning the latest version of IMGTHLA (3.35.0) results in a corrupted database due to an issue with GitHub's Large File Storage. To update the arcasHLA's reference to the current version, run the following commands.
./arcasHLA reference --update
```
curl https://media.githubusercontent.com/media/ANHIG/IMGTHLA/Latest/hla.dat > dat/IMGTHLA/hla.dat
./arcasHLA reference --rebuild --v
```


### Usage ###

Expand Down Expand Up @@ -124,12 +120,13 @@ arcasHLA genotype [options] /path/to/sample.alignment.p
#### Options ####
- `-g, --genes GENES` : comma separated list of HLA genes (ex. A,B,C,DQA1,DQB1,DRB1)
- `-p, --population POPULATION` : sample population, options are asian_pacific_islander, black, caucasian, hispanic, native_american and prior (default: Prior)
- `--min_count INT` : minimum gene read count required for genotyping (default: 75)
- `--tolerance FLOAT` : convergence tolerance for transcript quantification (default: 10e-7)
- `--max_iterations INT` : maximmum number of iterations for transcript quantification (default: 1000)
- `--drop_iterations INT` : number of iterations before dropping low support alleles, a lower number of iterations is recommended for single-end and low read couunt samples (default: paired - 10, single - 4)
- `--drop_iterations INT` : number of iterations before dropping low support alleles, a lower number of iterations is recommended for single-end and low read count samples (default: paired - 10, single - 4)
- `--drop_threshold FLOAT` : proportion of maximum abundance an allele needs to not be dropped (default: 0.1)
- `--zygosity_threshold FLOAT` : threshold for ratio of minor to major allele nonshared count to determine zygosity (default: 0.15)
- `--log FILE` : log file for run summary (default: sample.genotype.log)
- `--log FILE` : log file for run summary (default: `sample.genotype.log`)
- `--o, --outdir DIR` : output directory (default: `.`)
- `--temp DIR` : temp directory (default: `/tmp`)
- `--keep_files` : keep intermediate files (default: False)
Expand All @@ -148,15 +145,33 @@ Output: `sample.partial_alignment.p`, `sample.partial_genotype.json`
The options for partial typing are the same as genotype. Partial typing can be run from the intermediate alignment file.

### Merge jsons ###
To make analysis easier, this command will merge all jsons produced by genotyping. All `.genotype.json` files will be merged into a single `run.genotypes.json` file and all `.partial_genotype.json` files will be merged into `run.partial_genotypes.json`.
To make analysis easier, this command will merge all jsons produced by genotyping into a single table. All `.genotype.json` files will be merged into a single `run.genotypes.tsv` file and all `.partial_genotype.json` files will be merged into `run.partial_genotypes.tsv`. In addition, HLA locus read counts and relative abundance produced by alignment will be merged into a single tsv file.
```
arcasHLA merge [options]
```
#### Options ####
- `--run RUN` : run name
- `--i, --indir DIR` : input directory (default: `.`)
- `--o, --outdir DIR` : output directory (default: `.`)
- `-v, --verbose` : verbosity (default: False)
- `-v, --verbose` : toggle verbosity

### Convert HLA nomenclature ###
arcasHLA convert changes alleles in a tsv file from its input form to a specified grouped nomenclature (P-group or G-group) or a specified number of fields (i.e. 1, 2 or 3 fields in resolution). This file can be produced by arcasHLA merge or any tsv following the same structure:

| subject | A1 | A2 | B1 | B2 | C1 | C2 |
|-------------- |------------ |------------ |------------ |------------ |------------ |------------ |
| subject_name | A*01:01:01 | A*01:01:01 | B*07:02:01 | B*07:02:01 | C*04:01:01 | C*04:01:01 |

P-group (alleles sharing the same amino acid sequence in the antigen-binding region) and G-group (alleles sharing the same base sequence in the antigen-binding region) can only be reduced to 1-field resolution as alleles with differing 2nd fields can be in the same group. By the same reasoning, P-group cannot be converted into G-group.

```
arcasHLA convert --resolution [resolution] genotypes.tsv
```
#### Options ####
- `-r, --resolution RESOLUTION` : output resolution (1, 2, 3) or grouping (g-group, p-group)
- `-o, --outfile FILE` : output file (default: `./run.resolution.tsv`)
- `-f, --force` : force conversion for grouped alleles even if it results in loss of resolution
- `-v, --verbose` : toggle verbosity

### Change reference ###
To update the reference to the latest IMGT/HLA version, run
Expand All @@ -174,12 +189,19 @@ If you suspect there is an issue with the reference files, rebuild the referenc
```
arcasHLA reference --rebuild
```

Note: if your reference was built with arcasHLA version <= 0.1.1 and you wish to change your reference to versions >= 3.35.0, it may be necessary to remove the IMGTHLA folder due to the need for Git Large File Storage to properly download hla.dat.

```
rm -rf dat/IMGTHLA
arcasHLA reference --update
```

#### Options ####
- `--update` : update to latest IMGT/HLA version
- `--version` : checkout IMGT/HLA version using commithash
- `--rebuild` : rebuild HLA database
- `-v, --verbose` : verbosity (default: False)

## Citation ##
R. Orenbuch, I. Filip, D. Comito, J. Shaman, I. Pe’er, and R. Rabadan. arcasHLA:
high resolution HLA typing from RNA seq. bioRxiv doi: [10.1101/479824](https://doi.org/10.1101/479824)
Orenbuch R, Filip I, Comito D, et al (2019) arcasHLA: high resolution HLA typing from RNA seq. Bioinformatics doi:[10.1093/bioinformatics/btz474](http://dx.doi.org/10.1093/bioinformatics/btz474)
9 changes: 7 additions & 2 deletions arcasHLA
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#-----------------------------------------------------------------------------
#
# Authors: Rose Orenbuch, Ioan Filip, Itsik Pe'er
# Date: 2019-05-07
# Version: 0.1.1
# Date: 2019-06-26
# Version: 0.2.0
# License: GNU
#
#-----------------------------------------------------------------------------
Expand Down Expand Up @@ -70,6 +70,10 @@ elif [ "$1" == "partial" ]; then

python3 ${ARCASHLA_ROOT_DIR}/scripts/partial.py ${@:2}

elif [ "$1" == "convert" ]; then

python3 ${ARCASHLA_ROOT_DIR}/scripts/convert.py ${@:2}

#-----------------------------------------------------------------------------
# Usage
#-----------------------------------------------------------------------------
Expand All @@ -81,6 +85,7 @@ else
echo " genotype types HLA genes from extracted reads"
echo " partial types partial HLA genes from extracted reads"
echo " merge processes results into a tab-separated table"
echo " convert converts HLA nomenclature/resolution"
echo " reference check or update HLA reference"
echo
echo "Note: run any command with --help to view required fields, options"
Expand Down
Binary file modified dat/info/parameters.p
Binary file not shown.
29 changes: 16 additions & 13 deletions scripts/align.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@
from reference import check_ref, get_exon_combinations
from arcas_utilities import *

__version__ = '0.1.1'
__date__ = '2019-05-07'
__version__ = '0.2.0'
__date__ = '2019-06-26'

#-------------------------------------------------------------------------------
# Paths and filenames
Expand All @@ -69,17 +69,12 @@ def analyze_reads(fqs, paired, reads_file):
log.info('[alignment] Analyzing read length')
if paired:
fq1, fq2 = fqs

command = [cat, '<', fq1, awk, '>' , reads_file]
run_command(command)

command = [cat, '<', fq2, awk, '>>', reads_file]
run_command(command)
run_command([cat, '<', fq1, awk, '>' , reads_file])
run_command([cat, '<', fq2, awk, '>>', reads_file])

else:
fq = fqs[0]
command = [cat, '<', fq, awk, '>', reads_file]
run_command(command)
run_command([cat, '<', fq, awk, '>', reads_file])

read_lengths = np.genfromtxt(reads_file)

Expand All @@ -90,7 +85,6 @@ def analyze_reads(fqs, paired, reads_file):
avg = round(np.mean(read_lengths), 6)
std = round(np.std(read_lengths), 6)


return num, avg, std

def pseudoalign(fqs, sample, paired, reference, outdir, temp, threads):
Expand Down Expand Up @@ -232,6 +226,7 @@ def get_count_stats(eq_idx, gene_length):
return stats

def alignment_summary(align_stats, partial = False):
'''Prints alignment summary to log.'''
count_unique, count_multi, total, _, _ = align_stats
log.info('[alignment] Processed {:.0f} reads, {:.0f} pseudoaligned '
.format(total, count_unique + count_multi)+
Expand All @@ -241,7 +236,7 @@ def alignment_summary(align_stats, partial = False):
.format(count_unique))

def gene_summary(gene_stats):
# Print gene abundance information
'''Prints gene read count and relative abundance to log.'''

log.info('[alignment] Observed HLA genes:')

Expand All @@ -253,7 +248,7 @@ def gene_summary(gene_stats):
.format(g, a*100, c, e))

def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads, partial = False):

'''Runs pseudoalignment and processes output.'''
paired = True if len(fqs) == 2 else False

count_file = ''.join([temp, 'pseudoalignments.tsv'])
Expand All @@ -266,6 +261,8 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
outdir,
temp,
threads)

# Process partial genotyping pseudoalignment
if partial:
(commithash, (gene_set, allele_idx, exon_idx,
lengths, partial_exons, partial_alleles)) = reference_info
Expand All @@ -287,6 +284,7 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
align_stats, []]
pickle.dump(alignment_info, file)

# Process regular pseudoalignment
else:
(commithash,(gene_set, allele_idx,
lengths, gene_length)) = reference_info
Expand All @@ -309,9 +307,14 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
align_stats, gene_stats]
pickle.dump(alignment_info, file)

with open(''.join([outdir,sample,'.genes.json']), 'w') as file:
json.dump(gene_stats, file)

return alignment_info

def load_alignment(file, commithash, partial = False):
'''Loads previous pseudoalignment.'''

log.info(f'[alignment] Loading previous alignment %s', file)

with open(file, 'rb') as file:
Expand Down
5 changes: 3 additions & 2 deletions scripts/arcas_utilities.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@
import uuid
from subprocess import PIPE, STDOUT, run

__version__ = '0.1.1'
__date__ = '2019-05-07'
__version__ = '0.2.0'
__date__ = '2019-06-26'

#-------------------------------------------------------------------------------

Expand Down Expand Up @@ -93,6 +93,7 @@ def run_command(command, message = ''):
return output

def create_temp(temp):
'''Generates name for temporary folder.'''
temp = check_path(temp)
temp_folder = ''.join([temp,'arcas_' + str(uuid.uuid4())])
return check_path(temp_folder)
Expand Down
Loading

0 comments on commit 4881752

Please sign in to comment.