Merge pull request #18 from RabadanLab/dev

Dev
RabadanLab · Sep 12, 2019 · 4881752 · 4881752
2 parents 450e12c + 301085e
commit 4881752
Show file tree

Hide file tree

Showing 11 changed files with 461 additions and 102 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,10 @@
 arcasHLA performs high resolution genotyping for HLA class I and class II genes from RNA sequencing, supporting both paired and single-end samples.
 
 ### Dependencies ###
+arcasHLA requires the following utilities:
+- [Git Large File Storage](https://github.com/git-lfs/git-lfs/wiki/Installation)
+- coreutils
+
 Make sure the following programs are in your `PATH`:
 - [Samtools v1.19](http://www.htslib.org/)
 - [bedtools v2.27.1](http://bedtools.readthedocs.io/)
@@ -16,8 +20,6 @@ arcasHLA requires the following Python modules:
 - SciPy
 - Pandas
 
-arcasHLA also requires coreutils.
-
 ### Test ###
 In order to test arcasHLA partial typing, we need to roll back the reference to an earlier version. First, fetch IMGT/HLA database version 3.24.0:
 ```
@@ -55,16 +57,10 @@ Expected output in `test/output/test.partial_genotype.json`:
  "DQB1": ["DQB1*06:04:01", "DQB1*02:02:01"],
  "DRB1": ["DRB1*03:02:01", "DRB1*14:02:01"]}
 ```
-Before further usage, remember to update to the 3.34.0.
-```
-./arcasHLA reference --version 3.34.0
+Remember to update the HLA reference using the following command.
 ```
-At this time, cloning the latest version of IMGTHLA (3.35.0) results in a corrupted database due to an issue with GitHub's Large File Storage. To update the arcasHLA's reference to the current version, run the following commands.
+./arcasHLA reference --update
 ```
-curl https://media.githubusercontent.com/media/ANHIG/IMGTHLA/Latest/hla.dat > dat/IMGTHLA/hla.dat
-./arcasHLA reference --rebuild --v
-```
-
 
 ### Usage ###
 
@@ -124,12 +120,13 @@ arcasHLA genotype [options] /path/to/sample.alignment.p
 #### Options ####
 - `-g, --genes GENES`       : comma separated list of HLA genes (ex. A,B,C,DQA1,DQB1,DRB1)
 - `-p, --population POPULATION`  : sample population, options are asian_pacific_islander, black, caucasian, hispanic, native_american and prior (default: Prior)
+- `--min_count INT`   : minimum gene read count required for genotyping (default: 75)
 - `--tolerance FLOAT` : convergence tolerance for transcript quantification (default: 10e-7)
 - `--max_iterations INT` : maximmum number of iterations for transcript quantification (default: 1000)
-- `--drop_iterations INT` : number of iterations before dropping low support alleles, a lower number of iterations is recommended for single-end and low read couunt samples (default: paired - 10, single - 4)
+- `--drop_iterations INT` : number of iterations before dropping low support alleles, a lower number of iterations is recommended for single-end and low read count samples (default: paired - 10, single - 4)
 - `--drop_threshold FLOAT` : proportion of maximum abundance an allele needs to not be dropped (default: 0.1)
 - `--zygosity_threshold FLOAT` : threshold for ratio of minor to major allele nonshared count to determine zygosity (default: 0.15)
-- `--log FILE`        : log file for run summary (default: sample.genotype.log)                                                        
+- `--log FILE`        : log file for run summary (default: `sample.genotype.log`)                                                        
 - `--o, --outdir DIR` : output directory (default: `.`)                                                                               
 - `--temp DIR`        : temp directory (default: `/tmp`)                                                                              
 - `--keep_files`      : keep intermediate files (default: False)                                                                      
@@ -148,15 +145,33 @@ Output: `sample.partial_alignment.p`, `sample.partial_genotype.json`
 The options for partial typing are the same as genotype. Partial typing can be run from the intermediate alignment file.
 
 ### Merge jsons ###
-To make analysis easier, this command will merge all jsons produced by genotyping. All `.genotype.json` files will be merged into a single `run.genotypes.json` file and all `.partial_genotype.json` files will be merged into `run.partial_genotypes.json`.
+To make analysis easier, this command will merge all jsons produced by genotyping into a single table. All `.genotype.json` files will be merged into a single `run.genotypes.tsv` file and all `.partial_genotype.json` files will be merged into `run.partial_genotypes.tsv`. In addition, HLA locus read counts and relative abundance produced by alignment will be merged into a single tsv file.
 ```
 arcasHLA merge [options]
 ```
 #### Options ####
 - `--run RUN` : run name
 - `--i, --indir DIR` : input directory (default: `.`)     
 - `--o, --outdir DIR` : output directory (default: `.`)                                                                  
-- `-v, --verbose`     : verbosity (default: False)   
+- `-v, --verbose`     : toggle verbosity
+
+### Convert HLA nomenclature ###
+arcasHLA convert changes alleles in a tsv file from its input form to a specified grouped nomenclature (P-group or G-group) or a specified number of fields (i.e. 1, 2 or 3 fields in resolution). This file can be produced by arcasHLA merge or any tsv following the same structure:
+
+| subject      	| A1         	| A2         	| B1         	| B2         	| C1         	| C2         	|
+|--------------	|------------	|------------	|------------	|------------	|------------	|------------	|
+| subject_name 	| A*01:01:01 	| A*01:01:01 	| B*07:02:01 	| B*07:02:01 	| C*04:01:01 	| C*04:01:01 	|
+
+P-group (alleles sharing the same amino acid sequence in the antigen-binding region) and G-group (alleles sharing the same base sequence in the antigen-binding region) can only be reduced to 1-field resolution as alleles with differing 2nd fields can be in the same group. By the same reasoning, P-group cannot be converted into G-group.
+
+```
+arcasHLA convert --resolution [resolution] genotypes.tsv
+```
+#### Options ####
+- `-r, --resolution RESOLUTION` : output resolution (1, 2, 3) or grouping (g-group, p-group)
+- `-o, --outfile FILE` : output file (default: `./run.resolution.tsv`)
+- `-f, --force` : force conversion for grouped alleles even if it results in loss of resolution
+- `-v, --verbose`     : toggle verbosity
 
 ### Change reference ###
 To update the reference to the latest IMGT/HLA version, run
@@ -174,12 +189,19 @@ If you suspect there is an issue  with the reference files, rebuild the referenc
 ```
 arcasHLA reference --rebuild
 ```
+
+Note: if your reference was built with arcasHLA version <= 0.1.1 and you wish to change your reference to versions >= 3.35.0, it may be necessary to remove the IMGTHLA folder due to the need for Git Large File Storage to properly download hla.dat.
+
+```
+rm -rf dat/IMGTHLA
+arcasHLA reference --update
+```
+
 #### Options ####
 - `--update` : update to latest IMGT/HLA version
 - `--version` : checkout IMGT/HLA version using commithash
 - `--rebuild` : rebuild HLA database
 - `-v, --verbose`     : verbosity (default: False)   
 
 ## Citation ##
-R. Orenbuch, I. Filip, D. Comito, J. Shaman, I. Pe’er, and R. Rabadan. arcasHLA:
-high resolution HLA typing from RNA seq. bioRxiv doi: [10.1101/479824](https://doi.org/10.1101/479824)
+Orenbuch R, Filip I, Comito D, et al (2019) arcasHLA: high resolution HLA typing from RNA seq. Bioinformatics doi:[10.1093/bioinformatics/btz474](http://dx.doi.org/10.1093/bioinformatics/btz474)
diff --git a/arcasHLA b/arcasHLA
@@ -4,8 +4,8 @@
 #-----------------------------------------------------------------------------
 #
 #   Authors:    Rose Orenbuch, Ioan Filip, Itsik Pe'er
-#   Date:       2019-05-07
-#   Version:    0.1.1
+#   Date:       2019-06-26
+#   Version:    0.2.0
 #   License:    GNU
 #
 #-----------------------------------------------------------------------------
@@ -70,6 +70,10 @@ elif [ "$1" == "partial" ]; then
 
     python3 ${ARCASHLA_ROOT_DIR}/scripts/partial.py ${@:2}
 
+elif [ "$1" == "convert" ]; then
+
+    python3 ${ARCASHLA_ROOT_DIR}/scripts/convert.py ${@:2}
+
 #-----------------------------------------------------------------------------
 #   Usage
 #-----------------------------------------------------------------------------
@@ -81,6 +85,7 @@ else
     echo "    genotype       types HLA genes from extracted reads"
     echo "    partial        types partial HLA genes from extracted reads"
     echo "    merge          processes results into a tab-separated table"
+    echo "    convert        converts HLA nomenclature/resolution"
     echo "    reference      check or update HLA reference"
     echo 
     echo "Note: run any command with --help to view required fields, options"

diff --git a/dat/info/parameters.p b/dat/info/parameters.p
diff --git a/scripts/align.py b/scripts/align.py
@@ -43,8 +43,8 @@
 from reference import check_ref, get_exon_combinations
 from arcas_utilities import *
 
-__version__     = '0.1.1'
-__date__        = '2019-05-07'
+__version__     = '0.2.0'
+__date__        = '2019-06-26'
 
 #-------------------------------------------------------------------------------
 #   Paths and filenames
@@ -69,17 +69,12 @@ def analyze_reads(fqs, paired, reads_file):
     log.info('[alignment] Analyzing read length')
     if paired:
         fq1, fq2 = fqs
-
-        command = [cat, '<', fq1, awk, '>' , reads_file]
-        run_command(command)
-
-        command = [cat, '<', fq2, awk, '>>', reads_file]
-        run_command(command)
+        run_command([cat, '<', fq1, awk, '>' , reads_file])
+        run_command([cat, '<', fq2, awk, '>>', reads_file])
 
     else:
         fq = fqs[0]
-        command = [cat, '<', fq, awk, '>', reads_file]
-        run_command(command)
+        run_command([cat, '<', fq, awk, '>', reads_file])
 
     read_lengths = np.genfromtxt(reads_file)
 
@@ -90,7 +85,6 @@ def analyze_reads(fqs, paired, reads_file):
     avg = round(np.mean(read_lengths), 6)
     std = round(np.std(read_lengths), 6)
 
-
     return num, avg, std
 
 def pseudoalign(fqs, sample, paired, reference, outdir, temp, threads):
@@ -232,6 +226,7 @@ def get_count_stats(eq_idx, gene_length):
     return stats
 
 def alignment_summary(align_stats, partial = False):
+    '''Prints alignment summary to log.'''
     count_unique, count_multi, total, _, _ = align_stats
     log.info('[alignment] Processed {:.0f} reads, {:.0f} pseudoaligned '
              .format(total, count_unique + count_multi)+
@@ -241,7 +236,7 @@ def alignment_summary(align_stats, partial = False):
              .format(count_unique))
 
 def gene_summary(gene_stats):
-    # Print gene abundance information
+    '''Prints gene read count and relative abundance to log.'''
 
     log.info('[alignment] Observed HLA genes:')
 
@@ -253,7 +248,7 @@ def gene_summary(gene_stats):
                  .format(g, a*100, c, e))
 
 def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads, partial = False):
-
+    '''Runs pseudoalignment and processes output.'''
     paired = True if len(fqs) == 2 else False
 
     count_file = ''.join([temp, 'pseudoalignments.tsv'])
@@ -266,6 +261,8 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
                                  outdir, 
                                  temp,
                                  threads)
+
+    # Process partial genotyping pseudoalignment
     if partial:
         (commithash, (gene_set, allele_idx, exon_idx, 
             lengths, partial_exons, partial_alleles)) = reference_info
@@ -287,6 +284,7 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
                               align_stats, []]
             pickle.dump(alignment_info, file)
 
+    # Process regular pseudoalignment
     else:
         (commithash,(gene_set, allele_idx, 
              lengths, gene_length)) = reference_info
@@ -309,9 +307,14 @@ def get_alignment(fqs, sample, reference, reference_info, outdir, temp, threads,
                               align_stats, gene_stats]
             pickle.dump(alignment_info, file)
 
+        with open(''.join([outdir,sample,'.genes.json']), 'w') as file:
+            json.dump(gene_stats, file)
+
     return alignment_info
 
 def load_alignment(file, commithash, partial = False):
+    '''Loads previous pseudoalignment.'''
+
     log.info(f'[alignment] Loading previous alignment %s', file)
 
     with open(file, 'rb') as file:

diff --git a/scripts/arcas_utilities.py b/scripts/arcas_utilities.py
@@ -28,8 +28,8 @@
 import uuid
 from subprocess import PIPE, STDOUT, run
 
-__version__     = '0.1.1'
-__date__        = '2019-05-07'
+__version__     = '0.2.0'
+__date__        = '2019-06-26'
 
 #-------------------------------------------------------------------------------
 
@@ -93,6 +93,7 @@ def run_command(command, message = ''):
     return output
 
 def create_temp(temp):
+    '''Generates name for temporary folder.'''
     temp = check_path(temp)
     temp_folder = ''.join([temp,'arcas_' + str(uuid.uuid4())])
     return check_path(temp_folder)