Skip to content

Commit

Permalink
Merge pull request #10 from HKU-BAL/develop
Browse files Browse the repository at this point in the history
Uppercase sequences
  • Loading branch information
chaklim authored Dec 23, 2019
2 parents 8f5bf05 + 4b37852 commit 90dc1a9
Show file tree
Hide file tree
Showing 7 changed files with 68 additions and 23 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Changelog

All notable changes to Clair will be documented in this file.


## [2.0.6] - 2019-12-23

### Added
- Haploid mode: option `--haploid` in `callVarBam` and `callVarBamParallel`
41 changes: 21 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Clair - Yet another deep neural network based variant caller
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/clair/README.html) \
Contact: Ruibang Luo \
Email: [email protected]
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/clair/README.html)
Contact: Ruibang Luo
Email: [email protected]

## Introduction
Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks.
Expand Down Expand Up @@ -171,11 +171,13 @@ REFERENCE_FASTA_FILE_PATH="[YOUR_REFERENCE_FASTA_FILE]" # e.g. chr21.fa
KNOWN_VARIANTS_VCF="[YOUR_VCF_FILE]" # e.g. chr21.vcf
```

#### Note
* Each model has three files `model.data-00000-of-00001`, `model.index`, `model.meta`. For the `MODEL` variable, please use the prefix `model`
#### Notes
* Each model has three files `model.data-00000-of-00001`, `model.index`, `model.meta`. Please give the `MODEL` variable, the prefix `model`.

### Call variants at known variant sites or in a chromosome (using `callVarBam`)

**For whole genome variant calling, please use `callVarBamParallel` to generate multiple commands that invokes `callVarBam` on smaller chromosome chucks.**

#### Call variants in a chromosome
```bash
# variables
Expand Down Expand Up @@ -214,11 +216,6 @@ python $CLAIR callVarBam \
cd "$VARIANT_CALLING_OUTPUT_PATH"
```

#### Notes
* Use `callVarBam` to either call variant in a single chromosome. For whole genome variant calling, please use `callVarBamParallel` to generate multiple commands that invokes `callVarBam` on smaller chromosome chucks.
* You may consider using the `--pysam_for_all_indel_bases` option for more accurate results. On Illumina data and PacBio CCS data, the option requires 20% to 50% much running time. On ONT data, Clair can run two times slower, while the improvement in accuracy is not significant.
* About seting an appropriate allele frequency cutoff, please refer to [About Setting the Alternative Allele Frequency Cutoff](#about-setting-the-alternative-allele-frequency-cutoff)

### Call whole-genome variants in parallel (using `callVarBamParallel`)
```bash
# variables
Expand Down Expand Up @@ -247,16 +244,20 @@ for i in OUTPUT_PREFIX.*.vcf; do if ! [ -z "$(tail -c 1 "$i")" ]; then echo "$i"
vcfcat ${OUTPUT_PREFIX}.*.vcf | bcftools sort -m 2G | bgziptabix snp_and_indel.vcf.gz
```

#### Note
* `callVarBamParallel` submodule generates `callVarBam` commands that can be run in parallel
* `parallel -j4` will run four concurrencies in parallel using GNU parallel. We suggest using half the number of available CPU cores (not threads).
* If [GNU parallel](https://www.gnu.org/software/parallel/) is not installed, please try ```awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c```
* Incomplete VCF files happens when 'out of memory' or other errors occur. The command in the example finds for a newline at the end of the VCF files, and regenerate the files without a newline at the end.
* callVarBamParallel will generate commonds for chr{1..22},X,Y, to call variants on all chromosomes, please use option `--includingAllContigs`.
* If you are working on non-human BAM file (e.g. bacteria), please use `--includingAllContigs` option to include all contigs
* `CUDA_VISIBLE_DEVICES=""` makes GPUs invisible to Clair so it will use CPU for variant calling. Please notice that unless you want to run `commands.sh` in serial, you cannot use GPU because one running copy of Clair will occupy all available memory of a GPU. While the bottleneck of `callVarBam` is at the `CreateTensor` script, which runs on CPU, the effect of GPU accelerate is insignificant (roughly about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want split the `commands.sh` in to parts, and run the parts by firstly `export CUDA_VISIBLE_DEVICES="$i"`, where `$i` is an integer from 0 identifying the ID of the GPU to be used.
* `vcfcat` and `bgziptabix` commands are from [vcflib](https://github.com/vcflib/vcflib), and are installed by default using option 2 (conda) or option 3 (docker).
* Please also check the notes in the above sections for other considerations.
#### Notes
##### Parallelization
* `callVarBamParallel` generates a file of `callVarBam` commands that can be run in parallel.
* **Use GNU parallel to run commands in parallel** - `parallel -j4` will run four concurrencies in parallel using GNU parallel. We suggest using half the number of available CPU cores.
* **An alternative to GNU parallel** - If [GNU parallel](https://www.gnu.org/software/parallel/) is not installed, please try ```awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c```
##### Options
* **Haploid Mode** - For haploid samples, please use the `--haploid` option.
* **Choosing genome sequences and positions for variant calling** - callVarBamParallel by default will generate commonds for chromosome {1..22},X,Y (insensible to the "chr" prefix). To call variants in other sequences, you can either input via option "--ben_fn" your own BED file with three columns including the target sequence names, starting positions and ending positions, or use the option `--includingAllContigs` to include all sequences in the input FASTA file. If you work on a non-human sample, please always use a BED file or the `--includingAllContigs` option to define the sequences you want Clair to work on.
* **For more accurate Indel calling** - You may consider using the `--pysam_for_all_indel_bases` option for more accurate Indel results. On Illumina data and PacBio CCS data, the option requires 20% to 50% longer running time. On ONT data, Clair can run up to ten times slower, while the improvement in accuracy is not significant.
##### Other considerations
* **Setting an appropriate allele frequency cutoff** - Please refer to [About Setting the Alternative Allele Frequency Cutoff](#about-setting-the-alternative-allele-frequency-cutoff)
* **Check for incomplete (unfinished) VCF files** - Incomplete VCF files happens when 'out of memory' or other errors occur. The command in the example finds for a newline at the end of the VCF files, and regenerate the incomplete files.
* **Disabling GPU: Clair uses CPU for varaint calling** - To avoid the tensorflow library from using GPU, `CUDA_VISIBLE_DEVICES=""` makes GPUs invisible to Clair so it will only use CPU for variant calling. Please notice that unless you want to run `commands.sh` in serial, you cannot use GPU because one running copy of Clair will occupy all available memory of a GPU. While the bottleneck of `callVarBam` is at the `CreateTensor` script, which only runs on CPU, the effect of GPU accelerate is insignificant (roughly just about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want split the `commands.sh` in to parts, and run the parts by firstly `export CUDA_VISIBLE_DEVICES="$i"`, where `$i` is an integer from 0 identifying the ID of the GPU to be used.
* **Concatenating results** - `vcfcat` and `bgziptabix` commands are from [vcflib](https://github.com/vcflib/vcflib), and are installed by default using option 2 (conda) or option 3 (docker).

---

Expand Down
7 changes: 6 additions & 1 deletion clair/callVarBam.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def Run(args):
dcov = args.dcov
call_fn = args.call_fn
af_threshold = args.threshold
minCoverage = args.minCoverage
minCoverage = int(args.minCoverage)
sampleName = args.sampleName
ctgName = args.ctgName
if ctgName is None:
Expand All @@ -87,6 +87,7 @@ def Run(args):
stop_consider_left_edge = command_option_from(args.stop_consider_left_edge, 'stop_consider_left_edge')
log_path = command_option_from(args.log_path, 'log_path', option_value=args.log_path)
pysam_for_all_indel_bases = command_option_from(args.pysam_for_all_indel_bases, 'pysam_for_all_indel_bases')
haploid_mode = command_option_from(args.haploid, 'haploid')
debug = command_option_from(args.debug, 'debug')
qual = command_option_from(args.qual, 'qual', option_value=args.qual)
fast_plotting = command_option_from(args.fast_plotting, 'fast_plotting')
Expand Down Expand Up @@ -161,6 +162,7 @@ def Run(args):
CommandOption('threads', numCpus),
CommandOption('ref_fn', ref_fn),
pysam_for_all_indel_bases,
haploid_mode,
qual,
debug
]
Expand Down Expand Up @@ -293,6 +295,9 @@ def main():
parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
help="Always using pysam for outputting indel bases, optional")

parser.add_argument('--haploid', action='store_true',
help="call haploid instead of diploid")

parser.add_argument('--activation_only', action='store_true',
help="Output activation only, no prediction")
parser.add_argument('--max_plot', type=int, default=10,
Expand Down
5 changes: 5 additions & 0 deletions clair/callVarBamParallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ def Run(args):
stop_consider_left_edge = command_option_from(args.stop_consider_left_edge, 'stop_consider_left_edge')
log_path = command_option_from(args.log_path, 'log_path', option_value=args.log_path)
pysam_for_all_indel_bases = command_option_from(args.pysam_for_all_indel_bases, 'pysam_for_all_indel_bases')
haploid_mode = command_option_from(args.haploid, 'haploid')
debug = command_option_from(args.debug, 'debug')
qual = command_option_from(args.qual, 'qual', option_value=args.qual)
fast_plotting = command_option_from(args.fast_plotting, 'fast_plotting')
Expand All @@ -67,6 +68,7 @@ def Run(args):
stop_consider_left_edge,
debug,
pysam_for_all_indel_bases,
haploid_mode,
]

activation_only_command_options = [
Expand Down Expand Up @@ -174,6 +176,9 @@ def main():
parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
help="Always using pysam for outputting indel bases, optional")

parser.add_argument('--haploid', action='store_true',
help="call haploid instead of diploid")

parser.add_argument('--activation_only', action='store_true',
help="Output activation only, no prediction")
parser.add_argument('--max_plot', type=int, default=10,
Expand Down
19 changes: 19 additions & 0 deletions clair/call_var.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
OutputConfig = namedtuple('OutputConfig', [
'is_show_reference',
'is_debug',
'is_haploid_mode_enabled',
'quality_score_for_pass',
])
OutputUtilities = namedtuple('OutputUtilities', [
Expand Down Expand Up @@ -663,6 +664,7 @@ def output_from(
genotype_probabilities,
variant_length_probabilities_1,
variant_length_probabilities_2,
output_config,
output_utilities,
):
insertion_bases_using, deletion_bases_using, insertion_bases_using_pysam_using = (
Expand Down Expand Up @@ -723,6 +725,18 @@ def output_from(
is_hetero_DelDel = maximum_probability in hetero_DelDel_probabilities
is_insertion_and_deletion = maximum_probability in hetero_InsDel_probabilities


if output_config.is_haploid_mode_enabled:
if (
is_hetero_SNP or is_hetero_ACGT_Ins or is_hetero_InsIns or
is_hetero_ACGT_Del or is_hetero_DelDel or is_insertion_and_deletion
):
return (
(True, False, False, False, False, False, False, False, False, False),
(reference_base_ACGT, reference_base_ACGT)
)


if is_homo_SNP:
base1, base2 = homo_SNP_bases_from(gt21_probabilities)
reference_base = reference_sequence[tensor_position_center]
Expand Down Expand Up @@ -977,6 +991,7 @@ def batch_output(mini_batch, batch_Y, output_config, output_utilities):
genotype_probabilities,
variant_length_probabilities_1,
variant_length_probabilities_2,
output_config,
output_utilities,
)

Expand Down Expand Up @@ -1150,6 +1165,7 @@ def call_variants(args, m):
output_config = OutputConfig(
is_show_reference=args.showRef,
is_debug=args.debug,
is_haploid_mode_enabled=args.haploid,
quality_score_for_pass=args.qual,
)
output_utilities = output_utilties_from(
Expand Down Expand Up @@ -1266,6 +1282,9 @@ def main():
parser.add_argument('--pysam_for_all_indel_bases', action='store_true',
help="Always using pysam for outputting indel bases, optional")

parser.add_argument('--haploid', action='store_true',
help="call haploid instead of diploid")

args = parser.parse_args()

if len(sys.argv[1:]) == 0:
Expand Down
5 changes: 4 additions & 1 deletion dataPrepScripts/CreateTensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,9 @@ def reference_result_from(
reference_sequences.append(row.rstrip())
reference_sequence = "".join(reference_sequences)

# uppercase for masked sequences
reference_sequence = reference_sequence.upper()

faidx_process.stdout.close()
faidx_process.wait()

Expand Down Expand Up @@ -248,7 +251,7 @@ def OutputAlnTensor(args):
POS = int(l[3]) - 1 # switch from 1-base to 0-base to match sequence index
MQ = int(l[4])
CIGAR = l[5]
SEQ = l[9]
SEQ = l[9].upper() # uppercase for SEQ (regexp is \*|[A-Za-z=.]+)
reference_position = POS
query_position = 0
STRAND = (16 == (FLAG & 16))
Expand Down
5 changes: 4 additions & 1 deletion dataPrepScripts/ExtractVariantCandidates.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,9 @@ def reference_sequence_from(samtools_execute_command, fasta_file_path, regions):
# first line is reference name ">xxxx", need to be ignored
reference_sequence = "".join(refernce_sequences[1:])

# uppercase for masked sequences
reference_sequence = reference_sequence.upper()

samtools_faidx_process.stdout.close()
samtools_faidx_process.wait()
if samtools_faidx_process.returncode != 0:
Expand Down Expand Up @@ -268,7 +271,7 @@ def make_candidates(args):
POS = int(columns[3]) - 1 # switch from 1-base to 0-base to match sequence index
MAPQ = int(columns[4])
CIGAR = columns[5]
SEQ = columns[9]
SEQ = columns[9].upper() # uppercase for SEQ (regexp is \*|[A-Za-z=.]+)

reference_position = POS
query_position = 0
Expand Down

0 comments on commit 90dc1a9

Please sign in to comment.