Merge pull request #10 from HKU-BAL/develop

Uppercase sequences
HKU-BAL · Dec 23, 2019 · 90dc1a9 · 90dc1a9
2 parents 8f5bf05 + 4b37852
commit 90dc1a9
Show file tree

Hide file tree

Showing 7 changed files with 68 additions and 23 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,9 @@
+# Changelog
+
+All notable changes to Clair will be documented in this file.
+
+
+## [2.0.6] - 2019-12-23
+
+### Added
+- Haploid mode: option `--haploid` in `callVarBam` and `callVarBamParallel`
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # Clair - Yet another deep neural network based variant caller
-[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/clair/README.html) \
-Contact: Ruibang Luo \
-Email: [email protected]
+[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/clair/README.html)  
+Contact: Ruibang Luo  
+Email: [email protected]  
 
 ## Introduction
 Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks.
@@ -171,11 +171,13 @@ REFERENCE_FASTA_FILE_PATH="[YOUR_REFERENCE_FASTA_FILE]"  # e.g. chr21.fa
 KNOWN_VARIANTS_VCF="[YOUR_VCF_FILE]"                     # e.g. chr21.vcf
 ```
 
-#### Note
-* Each model has three files `model.data-00000-of-00001`, `model.index`, `model.meta`. For the `MODEL` variable, please use the prefix `model`
+#### Notes
+* Each model has three files `model.data-00000-of-00001`, `model.index`, `model.meta`. Please give the `MODEL` variable, the prefix `model`.
 
 ### Call variants at known variant sites or in a chromosome  (using `callVarBam`)
 
+**For whole genome variant calling, please use `callVarBamParallel` to generate multiple commands that invokes `callVarBam` on smaller chromosome chucks.**
+
 #### Call variants in a chromosome
 ```bash
 # variables
@@ -214,11 +216,6 @@ python $CLAIR callVarBam \
 cd "$VARIANT_CALLING_OUTPUT_PATH"
 ```
 
-#### Notes
-* Use `callVarBam` to either call variant in a single chromosome. For whole genome variant calling, please use `callVarBamParallel` to generate multiple commands that invokes `callVarBam` on smaller chromosome chucks.
-* You may consider using the `--pysam_for_all_indel_bases` option for more accurate results. On Illumina data and PacBio CCS data, the option requires 20% to 50% much running time. On ONT data, Clair can run two times slower, while the improvement in accuracy is not significant.
-* About seting an appropriate allele frequency cutoff, please refer to [About Setting the Alternative Allele Frequency Cutoff](#about-setting-the-alternative-allele-frequency-cutoff)
-
 ### Call whole-genome variants in parallel (using `callVarBamParallel`)
 ```bash
 # variables
@@ -247,16 +244,20 @@ for i in OUTPUT_PREFIX.*.vcf; do if ! [ -z "$(tail -c 1 "$i")" ]; then echo "$i"
 vcfcat ${OUTPUT_PREFIX}.*.vcf | bcftools sort -m 2G | bgziptabix snp_and_indel.vcf.gz
 ```
 
-#### Note
-* `callVarBamParallel` submodule generates `callVarBam` commands that can be run in parallel
-* `parallel -j4` will run four concurrencies in parallel using GNU parallel. We suggest using half the number of available CPU cores (not threads).
-* If [GNU parallel](https://www.gnu.org/software/parallel/) is not installed, please try ```awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c```
-* Incomplete VCF files happens when 'out of memory' or other errors occur. The command in the example finds for a newline at the end of the VCF files, and regenerate the files without a newline at the end.
-* callVarBamParallel will generate commonds for chr{1..22},X,Y, to call variants on all chromosomes, please use option `--includingAllContigs`.
-* If you are working on non-human BAM file (e.g. bacteria), please use `--includingAllContigs` option to include all contigs
-* `CUDA_VISIBLE_DEVICES=""` makes GPUs invisible to Clair so it will use CPU for variant calling. Please notice that unless you want to run `commands.sh` in serial, you cannot use GPU because one running copy of Clair will occupy all available memory of a GPU. While the bottleneck of `callVarBam` is at the `CreateTensor` script, which runs on CPU, the effect of GPU accelerate is insignificant (roughly about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want split the `commands.sh` in to parts, and run the parts by firstly `export CUDA_VISIBLE_DEVICES="$i"`, where `$i` is an integer from 0 identifying the ID of the GPU to be used.
-* `vcfcat` and `bgziptabix` commands are from [vcflib](https://github.com/vcflib/vcflib), and are installed by default using option 2 (conda) or option 3 (docker).
-* Please also check the notes in the above sections for other considerations.
+#### Notes
+##### Parallelization
+* `callVarBamParallel` generates a file of `callVarBam` commands that can be run in parallel.
+* **Use GNU parallel to run commands in parallel** - `parallel -j4` will run four concurrencies in parallel using GNU parallel. We suggest using half the number of available CPU cores.
+* **An alternative to GNU parallel** - If [GNU parallel](https://www.gnu.org/software/parallel/) is not installed, please try ```awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c```
+##### Options
+* **Haploid Mode** - For haploid samples, please use the `--haploid` option.
+* **Choosing genome sequences and positions for variant calling** - callVarBamParallel by default will generate commonds for chromosome {1..22},X,Y (insensible to the "chr" prefix). To call variants in other sequences, you can either input via option "--ben_fn" your own BED file with three columns including the target sequence names, starting positions and ending positions, or use the option `--includingAllContigs` to include all sequences in the input FASTA file. If you work on a non-human sample, please always use a BED file or the `--includingAllContigs` option to define the sequences you want Clair to work on.
+* **For more accurate Indel calling** - You may consider using the `--pysam_for_all_indel_bases` option for more accurate Indel results. On Illumina data and PacBio CCS data, the option requires 20% to 50% longer running time. On ONT data, Clair can run up to ten times slower, while the improvement in accuracy is not significant.
+##### Other considerations
+* **Setting an appropriate allele frequency cutoff** - Please refer to [About Setting the Alternative Allele Frequency Cutoff](#about-setting-the-alternative-allele-frequency-cutoff)
+* **Check for incomplete (unfinished) VCF files** - Incomplete VCF files happens when 'out of memory' or other errors occur. The command in the example finds for a newline at the end of the VCF files, and regenerate the incomplete files.
+* **Disabling GPU: Clair uses CPU for varaint calling** - To avoid the tensorflow library from using GPU, `CUDA_VISIBLE_DEVICES=""` makes GPUs invisible to Clair so it will only use CPU for variant calling. Please notice that unless you want to run `commands.sh` in serial, you cannot use GPU because one running copy of Clair will occupy all available memory of a GPU. While the bottleneck of `callVarBam` is at the `CreateTensor` script, which only runs on CPU, the effect of GPU accelerate is insignificant (roughly just about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want split the `commands.sh` in to parts, and run the parts by firstly `export CUDA_VISIBLE_DEVICES="$i"`, where `$i` is an integer from 0 identifying the ID of the GPU to be used.
+* **Concatenating results** - `vcfcat` and `bgziptabix` commands are from [vcflib](https://github.com/vcflib/vcflib), and are installed by default using option 2 (conda) or option 3 (docker).
 
 ---
 

diff --git a/clair/callVarBam.py b/clair/callVarBam.py
@@ -78,7 +78,7 @@ def Run(args):
     dcov = args.dcov
     call_fn = args.call_fn
     af_threshold = args.threshold
-    minCoverage = args.minCoverage
+    minCoverage = int(args.minCoverage)
     sampleName = args.sampleName
     ctgName = args.ctgName
     if ctgName is None:
@@ -87,6 +87,7 @@ def Run(args):
     stop_consider_left_edge = command_option_from(args.stop_consider_left_edge, 'stop_consider_left_edge')
     log_path = command_option_from(args.log_path, 'log_path', option_value=args.log_path)
     pysam_for_all_indel_bases = command_option_from(args.pysam_for_all_indel_bases, 'pysam_for_all_indel_bases')
+    haploid_mode = command_option_from(args.haploid, 'haploid')
     debug = command_option_from(args.debug, 'debug')
     qual = command_option_from(args.qual, 'qual', option_value=args.qual)
     fast_plotting = command_option_from(args.fast_plotting, 'fast_plotting')
@@ -161,6 +162,7 @@ def Run(args):
         CommandOption('threads', numCpus),
         CommandOption('ref_fn', ref_fn),
         pysam_for_all_indel_bases,
+        haploid_mode,
         qual,
         debug
     ]
@@ -293,6 +295,9 @@ def main():
     parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
                         help="Always using pysam for outputting indel bases, optional")
 
+    parser.add_argument('--haploid', action='store_true',
+                        help="call haploid instead of diploid")
+
     parser.add_argument('--activation_only', action='store_true',
                         help="Output activation only, no prediction")
     parser.add_argument('--max_plot', type=int, default=10,

diff --git a/clair/callVarBamParallel.py b/clair/callVarBamParallel.py
@@ -45,6 +45,7 @@ def Run(args):
     stop_consider_left_edge = command_option_from(args.stop_consider_left_edge, 'stop_consider_left_edge')
     log_path = command_option_from(args.log_path, 'log_path', option_value=args.log_path)
     pysam_for_all_indel_bases = command_option_from(args.pysam_for_all_indel_bases, 'pysam_for_all_indel_bases')
+    haploid_mode = command_option_from(args.haploid, 'haploid')
     debug = command_option_from(args.debug, 'debug')
     qual = command_option_from(args.qual, 'qual', option_value=args.qual)
     fast_plotting = command_option_from(args.fast_plotting, 'fast_plotting')
@@ -67,6 +68,7 @@ def Run(args):
         stop_consider_left_edge,
         debug,
         pysam_for_all_indel_bases,
+        haploid_mode,
     ]
 
     activation_only_command_options = [
@@ -174,6 +176,9 @@ def main():
     parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
                         help="Always using pysam for outputting indel bases, optional")
 
+    parser.add_argument('--haploid', action='store_true',
+                        help="call haploid instead of diploid")
+
     parser.add_argument('--activation_only', action='store_true',
                         help="Output activation only, no prediction")
     parser.add_argument('--max_plot', type=int, default=10,

diff --git a/clair/call_var.py b/clair/call_var.py
@@ -34,6 +34,7 @@
 OutputConfig = namedtuple('OutputConfig', [
     'is_show_reference',
     'is_debug',
+    'is_haploid_mode_enabled',
     'quality_score_for_pass',
 ])
 OutputUtilities = namedtuple('OutputUtilities', [
@@ -663,6 +664,7 @@ def output_from(
     genotype_probabilities,
     variant_length_probabilities_1,
     variant_length_probabilities_2,
+    output_config,
     output_utilities,
 ):
     insertion_bases_using, deletion_bases_using, insertion_bases_using_pysam_using = (
@@ -723,6 +725,18 @@ def output_from(
         is_hetero_DelDel = maximum_probability in hetero_DelDel_probabilities
         is_insertion_and_deletion = maximum_probability in hetero_InsDel_probabilities
 
+
+        if output_config.is_haploid_mode_enabled:
+            if (
+                is_hetero_SNP or is_hetero_ACGT_Ins or is_hetero_InsIns or
+                is_hetero_ACGT_Del or is_hetero_DelDel or is_insertion_and_deletion
+            ):
+                return (
+                    (True, False, False, False, False, False, False, False, False, False),
+                    (reference_base_ACGT, reference_base_ACGT)
+                )
+
+
         if is_homo_SNP:
             base1, base2 = homo_SNP_bases_from(gt21_probabilities)
             reference_base = reference_sequence[tensor_position_center]
@@ -977,6 +991,7 @@ def batch_output(mini_batch, batch_Y, output_config, output_utilities):
             genotype_probabilities,
             variant_length_probabilities_1,
             variant_length_probabilities_2,
+            output_config,
             output_utilities,
         )
 
@@ -1150,6 +1165,7 @@ def call_variants(args, m):
     output_config = OutputConfig(
         is_show_reference=args.showRef,
         is_debug=args.debug,
+        is_haploid_mode_enabled=args.haploid,
         quality_score_for_pass=args.qual,
     )
     output_utilities = output_utilties_from(
@@ -1266,6 +1282,9 @@ def main():
     parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
                         help="Always using pysam for outputting indel bases, optional")
 
+    parser.add_argument('--haploid', action='store_true',
+                        help="call haploid instead of diploid")
+
     args = parser.parse_args()
 
     if len(sys.argv[1:]) == 0:

diff --git a/dataPrepScripts/CreateTensor.py b/dataPrepScripts/CreateTensor.py
@@ -146,6 +146,9 @@ def reference_result_from(
             reference_sequences.append(row.rstrip())
     reference_sequence = "".join(reference_sequences)
 
+    # uppercase for masked sequences
+    reference_sequence = reference_sequence.upper()
+
     faidx_process.stdout.close()
     faidx_process.wait()
 
@@ -248,7 +251,7 @@ def OutputAlnTensor(args):
         POS = int(l[3]) - 1  # switch from 1-base to 0-base to match sequence index
         MQ = int(l[4])
         CIGAR = l[5]
-        SEQ = l[9]
+        SEQ = l[9].upper()   # uppercase for SEQ (regexp is \*|[A-Za-z=.]+)
         reference_position = POS
         query_position = 0
         STRAND = (16 == (FLAG & 16))

diff --git a/dataPrepScripts/ExtractVariantCandidates.py b/dataPrepScripts/ExtractVariantCandidates.py
@@ -141,6 +141,9 @@ def reference_sequence_from(samtools_execute_command, fasta_file_path, regions):
     # first line is reference name ">xxxx", need to be ignored
     reference_sequence = "".join(refernce_sequences[1:])
 
+    # uppercase for masked sequences
+    reference_sequence = reference_sequence.upper()
+
     samtools_faidx_process.stdout.close()
     samtools_faidx_process.wait()
     if samtools_faidx_process.returncode != 0:
@@ -268,7 +271,7 @@ def make_candidates(args):
             POS = int(columns[3]) - 1  # switch from 1-base to 0-base to match sequence index
             MAPQ = int(columns[4])
             CIGAR = columns[5]
-            SEQ = columns[9]
+            SEQ = columns[9].upper()   # uppercase for SEQ (regexp is \*|[A-Za-z=.]+)
 
             reference_position = POS
             query_position = 0