Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package the tool. Support gatk4 from conda. Add --genome option. #12

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
build
dist
*.egg*
36 changes: 22 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,37 +30,44 @@ To create these files please follow: http://gatkforums.broadinstitute.org/gatk/d

# Manual


**Dependencies:**

* python 2.7 or higher : [www.python.org](https://www.python.org/)
* numpy 1.7.0 or higher : [www.numpy.org](http://www.numpy.org/)
* scipy 0.14.0 or higher : [www.scipy.org](http://www.scipy.org/)
* GATK 2.3 or higher : [www.broadinstitute.org/gatk/download](http://www.broadinstitute.org/gatk/download/)
* GATK 3 or higher : [www.broadinstitute.org/gatk/download](http://www.broadinstitute.org/gatk/download/)
* java : [http://java.com](http://java.com/en/download/)

**Installation:**

Use pip to install the package:
```
pip install .
```
If you are getting permission issues, you may want to install into a [virtual environment](https://docs.python.org/3/tutorial/venv.html).

**Setting environmental variables:**
To use Conpair you need to set 2 environmental variables and a PYTHONPATH variable (e.g. by adding following lines to your .bashrc file):
To set up GATK, you should either set the environment variable:
```
export CONPAIR_DIR=/your/path/to/CONPAIR
export GATK_JAR=/your/path/to/GenomeAnalysisTK.jar
```

export PYTHONPATH=${PYTHONPATH}:/your/path/to/CONPAIR/modules/
Or install gatk4 through [bioconda](https://bioconda.github.io/):
```
conda install -c bioconda gatk4
```

**Default reference genome:**

To avoid specifying the reference file every time you run Conpair, please make sure that you have the following files in the specified directory:
`/your/path/to/CONPAIR/data/genomes/human_g1k_v37.fa`
`/your/path/to/CONPAIR/data/genomes/human_g1k_v37.fa.fai`
`/your/path/to/CONPAIR/data/genomes/human_g1k_v37.dict`
To avoid specifying the reference file every time you run Conpair, you can put/link the following files in the specified directory:
`/your/path/to/CONPAIR/genomes/human_g1k_v37.fa`
`/your/path/to/CONPAIR/genomes/human_g1k_v37.fa.fai`
`/your/path/to/CONPAIR/genomes/human_g1k_v37.dict`
<br/>

**Most common usage and additional options:**
To generate pileups (GATK required):
```

${CONPAIR_DIR}/scripts/run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup
${CONPAIR_DIR}/scripts/run_gatk_pileup_for_sample.py -B NORMAL_bam -O NORMAL_pileup
run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup
run_gatk_pileup_for_sample.py -B NORMAL_bam -O NORMAL_pileup


Optional:
Expand Down Expand Up @@ -105,6 +112,7 @@ Optional:
--grid GRID grid interval [default: 0.01]

```

# Output files
**Pileup**
An example of a pileup file (10 first lines) can be viewed here: ([`pileup.txt`](https://github.com/nygenome/Conpair/blob/master/data/example/pileup/NA12878_normal40x.gatk.pileup.10lines.txt)).
Expand Down
21 changes: 13 additions & 8 deletions modules/ContaminationMarker.py → conpair/ContaminationMarker.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

import os
import itertools
from Genotypes import *
from conpair.Genotypes import *
from collections import OrderedDict


Expand Down Expand Up @@ -63,13 +63,16 @@ def __init__(self, chrom, pos, ref, qual_A, qual_C, qual_G, qual_T):

def parse_mpileup_line(line, min_map_quality=0, min_base_quality=0):

line = line.split()
chrom = line[0]
pos = line[1]
ref = line[2]
bases = line[3]
baseQs = baseQ2int(line[4])

try:
line = line.split()
chrom = line[0]
pos = line[1]
ref = line[2]
bases = line[3]
baseQs = baseQ2int(line[4])
except:
return None

if min_map_quality > 0:
verbose_lines = line[6].split(',')
mapqs = [v.split('@')[-1] for v in verbose_lines]
Expand Down Expand Up @@ -104,6 +107,8 @@ def genotype_likelihoods_for_markers(Markers, mpileup_file, min_map_quality=0, m
if line.startswith("[REDUCE RESULT]"):
continue
pileup = parse_mpileup_line(line, min_map_quality=min_map_quality, min_base_quality=min_base_quality)
if not pileup:
continue
try:
marker = Markers[pileup.chrom + ":" + pileup.pos]
except:
Expand Down
Binary file added conpair/ContaminationMarker.pyc
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
import os
import numpy as np
from collections import defaultdict
from MathOperations import *
from conpair.MathOperations import *
from math import log10
import scipy
from scipy import stats
Expand Down
Binary file added conpair/ContaminationModel.pyc
Binary file not shown.
File renamed without changes.
Binary file added conpair/Genotypes.pyc
Binary file not shown.
File renamed without changes.
Binary file added conpair/MathOperations.pyc
Binary file not shown.
46 changes: 46 additions & 0 deletions conpair/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import os
import sys
import glob


def which(program):
""" returns the path to an executable or None if it can't be found"""
def is_exe(_fpath):
return os.path.isfile(_fpath) and os.access(_fpath, os.X_OK)

fpath, fname = os.path.split(program)
if fpath:
if is_exe(program):
return program
else:
for path in os.environ['PATH'].split(os.pathsep):
exe_file = os.path.join(path, program)
if is_exe(exe_file):
return exe_file
return None


def find_markers_file(opts, ext='.bed'):
libs_dir = os.path.dirname(__file__)
markers_dir = os.path.join(libs_dir, 'markers')

if opts.markers:
markers_file = opts.markers
else:
genome = opts.genome
if genome in ['GRCh37', 'hg19']:
markers_file = os.path.join(markers_dir, 'GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8' + ext)
elif genome in ['GRCh38', 'hg38']:
markers_file = os.path.join(markers_dir, 'GRCh38.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.liftover' + ext)
elif genome in ['GRCm38', 'mm10', 'mouse']:
markers_file = os.path.join(markers_dir, 'SureSelect_Mouse_All_Exon_V1_GRCm38_markers_MAF_0.4_LD_0.8' + ext)
else:
print('Genome ' + opts.genome + ' is not recognized. Available: GRCh37, GRCh38, GRCm38')
sys.exit(2)

if not os.path.exists(markers_file):
print('ERROR: Marker file {0} cannot be found.'.format(markers_file))
sys.exit(2)

return markers_file

Binary file added conpair/__init__.pyc
Binary file not shown.
26 changes: 5 additions & 21 deletions scripts/estimate_tumor_normal_contamination.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,11 @@
import sys
import os
import optparse
import imp
from collections import defaultdict
import numpy as np
from math import pow
from conpair import ContaminationModel, ContaminationMarker, MathOperations, Genotypes
from conpair import find_markers_file


HOMOZYGOUS_P_VALUE_THRESHOLD = 0.999
Expand All @@ -25,24 +26,14 @@
parser = optparse.OptionParser(version='%prog version 1.0 March/01/2016', description=desc)
parser.add_option('-T', '--tumor_pileup', help='TUMOR PILEUP FILE [mandatory field]', type='string', action='store')
parser.add_option('-N', '--normal_pileup', help='NORMAL PILEUP FILE [mandatory field]', type='string', action='store')
parser.add_option('-D', '--conpair_dir', help='CONPAIR DIR [default: $CONPAIR_DIR]', action='store')
parser.add_option('-M', '--markers', help='MARKER FILE [default: markers for GRCh37 from $CONPAIR_DIR/data/markers/ ]', type='string', action='store')
parser.add_option('-g', '--genome', help='Instead of the marker file path, you can specify the genome build name (GRCh37, GRCh38, GRCm38) [default: GRCh37]', default='GRCh37', action='store')
parser.add_option('-O', '--outfile', help='TXT OUTPUT FILE [default: stdout]', default="-", type='string', action='store')
parser.add_option('-G', '--grid', help='GRID INTERVAL [default: 0.01]', type='float', default=0.01, action='store')
parser.add_option('-Q', '--min_mapping_quality', help='MIN MAPPING QUALITY [default: 10]', default=10, type='int', action='store')

(opts, args) = parser.parse_args()

if opts.conpair_dir:
CONPAIR_DIR = opts.conpair_dir
else:
CONPAIR_DIR = os.environ['CONPAIR_DIR']

ContaminationModel = imp.load_source('/ContaminationModel', CONPAIR_DIR + '/modules/ContaminationModel.py')
ContaminationMarker = imp.load_source('/ContaminationMarker', CONPAIR_DIR + '/modules/ContaminationMarker.py')
MathOperations = imp.load_source('/MathOperations', CONPAIR_DIR + '/modules/MathOperations.py')
Genotypes = imp.load_source('/Genotypes', CONPAIR_DIR + '/modules/Genotypes.py')

if not opts.tumor_pileup or not opts.normal_pileup:
parser.print_help()
sys.exit(1)
Expand All @@ -54,15 +45,8 @@
if not os.path.exists(opts.normal_pileup):
print('ERROR: Input normal file {0} cannot be find.'.format(opts.normal_pileup))
sys.exit(1)

if opts.markers:
MARKER_FILE = opts.markers
else:
MARKER_FILE = os.path.join(CONPAIR_DIR, 'data', 'markers', 'GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.txt')

if not os.path.exists(MARKER_FILE):
print('ERROR: Marker file {0} cannot be find.'.format(MARKER_FILE))
sys.exit(2)
markers_file = find_markers_file(opts, '.txt')

grid_precision = opts.grid
MMQ = opts.min_mapping_quality
Expand All @@ -73,7 +57,7 @@ def drange(start, stop, step):
yield r
r += step

Markers = ContaminationMarker.get_markers(MARKER_FILE)
Markers = ContaminationMarker.get_markers(markers_file)

Normal_homozygous_genotype = defaultdict(lambda: defaultdict())

Expand Down
61 changes: 35 additions & 26 deletions scripts/run_gatk_pileup_for_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import optparse
from shutil import move
from tempfile import NamedTemporaryFile
from conpair import which, find_markers_file


desc = """Program to run GATK Pileup on a single sample"""
Expand All @@ -24,7 +25,9 @@
parser.add_option('-D', '--conpair_dir', help='CONPAIR DIR [$CONPAIR_DIR by default]', action='store')
parser.add_option('-R', '--reference', help='REFERENCE GENOME [GRCh37 by default]', action='store')
parser.add_option('-M', '--markers', help='MARKER FILE [GRCh37-default]', action='store')
parser.add_option('-G', '--gatk', help='GATK JAR [$GATK by default]', action='store')
parser.add_option('-g', '--genome', help='Instead of the marker file path, you can specify the genome build name (GRCh37, GRCh38, GRCm38) [default: GRCh37]', default='GRCh37', action='store')
parser.add_option('-G', '--gatk', help='GATK JAR [by default, will check if gatk executable as available in PATH assuming it\'s GATK4 installed with bioconda. '
'Otherwise, will assume GATK JAR is in $GATK_JAR environment variable]', action='store')
parser.add_option('-J', '--java', help='PATH to JAVA [java by default]', default='java', action='store')
parser.add_option('-t', '--temp_dir_java', help='temporary directory to set -Djava.io.tmpdir', action='store')
parser.add_option('-m', '--xmx_java', help='Xmx java memory setting [default: 12g]', default='12g', action='store')
Expand All @@ -41,37 +44,39 @@
print('ERROR: Specified bamfile {0} cannot be found.'.format(opts.bam))
sys.exit(1)

if opts.conpair_dir:
CONPAIR_DIR = opts.conpair_dir
else:
CONPAIR_DIR = os.environ['CONPAIR_DIR']

GATK = None
GATK4 = None
if opts.gatk:
GATK = opts.gatk
else:
elif 'GATK_JAR' in os.environ:
GATK = os.environ['GATK_JAR']
elif which('gatk'):
GATK4 = 'gatk'
else:
print('ERROR: cannot find GATK. Try either installing GATK4 with conda (conda install -c bioconda gatk4), or provide GATK2 or GATK3 JAR with `--gatk` or GATK_JAR environment variable')
sys.exit(2)

if not os.path.exists(GATK):
if GATK and not os.path.exists(GATK):
print('ERROR: GATK jar {0} cannot be find.'.format(GATK))
sys.exit(2)

if opts.markers:
MARKER_FILE = opts.markers
else:
MARKER_FILE = os.path.join(CONPAIR_DIR, 'data', 'markers', 'GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.bed')

if not os.path.exists(MARKER_FILE):
print('ERROR: Marker file {0} cannot be find.'.format(MARKER_FILE))
sys.exit(2)
markers_file = find_markers_file(opts, '.bed')

if opts.reference:
REFERENCE = opts.reference
reference_fa = opts.reference
if not os.path.exists(reference_fa):
print('ERROR: Reference genome {0} cannot be find.'.format(reference_fa))
sys.exit(3)
else:
REFERENCE = os.path.join(CONPAIR_DIR, 'data', 'genomes', 'human_g1k_v37.fa')

if not os.path.exists(REFERENCE):
print('ERROR: Reference genome {0} cannot be find.'.format(REFERENCE))
sys.exit(3)
conpair_dir = os.environ.get('CONPAIR_DIR', opts.conpair_dir)
if conpair_dir:
reference_fa = os.path.join(conpair_dir, 'genomes', 'human_g1k_v37.fa')
if not os.path.exists(reference_fa):
print('ERROR: Please provide reference fasta file with --reference, or put it into ' + reference_fa)
sys.exit(3)
else:
print('ERROR: Please provide reference fasta file with --reference')
sys.exit(3)

if opts.temp_dir_java:
JAVA_TEMP = "-Djava.io.tmpdir={}".format(opts.temp_dir_java)
Expand All @@ -80,10 +85,14 @@
else:
JAVA_TEMP = ""

command_line = ("{7} {0} -Xmx{1} -jar {2} -T Pileup -R {3} -I {4} -L {5} -o {6} " +
"-verbose -rf DuplicateRead --filter_reads_with_N_cigar " +
"--filter_mismatching_base_and_quals").format(JAVA_TEMP, opts.xmx_java, GATK, REFERENCE, opts.bam, MARKER_FILE, opts.outfile, opts.java)

if GATK4:
command_line = ("{GATK4} --java-options '{JAVA_TEMP} -Xmx{opts.xmx_java}' Pileup -R {reference_fa} -I {opts.bam} -L {markers_file} -O {opts.outfile} " +
"-verbose -RF NotDuplicateReadFilter -RF CigarContainsNoNOperator " +
"-RF MatchingBasesAndQualsReadFilter").format(**locals())
else:
command_line = ("{opts.java} {JAVA_TEMP} -Xmx{opts.xmx_java} -jar {GATK} -T Pileup -R {reference_fa} -I {opts.bam} -L {markers_file} -o {opts.outfile} " +
"-verbose -rf DuplicateRead --filter_reads_with_N_cigar " +
"--filter_mismatching_base_and_quals").format(**locals())
os.system(command_line)


Expand Down
17 changes: 6 additions & 11 deletions scripts/verify_concordance.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@
import optparse
import math
from collections import defaultdict
from ContaminationModel import *
from ContaminationMarker import *
from conpair.ContaminationModel import *
from conpair.ContaminationMarker import *
from conpair import find_markers_file

CONPAIR_DIR = os.environ['CONPAIR_DIR']
MARKER_FILE = os.path.join(CONPAIR_DIR, 'data', 'markers', 'GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.txt')

desc = """Program to verify tumor-normal sample concordance"""
parser = optparse.OptionParser(version='%prog version 0.15 3/August/2016', description=desc)
parser.add_option('-T', '--tumor_pileup', help='TUMOR PILEUP FILE [mandatory field]', action='store')
parser.add_option('-N', '--normal_pileup', help='NORMAL PILEUP FILE [mandatory field]', action='store')
parser.add_option('-M', '--markers', help='MARKER FILE [Conpair-GRCh37-default]', action='store')
parser.add_option('-g', '--genome', help='Instead of the marker file path, you can specify the genome build name (GRCh37, GRCh38, GRCm38) [default: GRCh37]', default='GRCh37', action='store')
parser.add_option('-C', '--min_cov', help='MIN COVERAGE TO CALL GENOTYPE [default: 10]', default=10, type='int', action='store')
parser.add_option('-O', '--outfile', help='TXT OUTPUT FILE [stdout by default]', default="-", type='string', action='store')
parser.add_option('-Q', '--min_mapping_quality', help='MIN MAPPING QUALITY [default: 10]', default=10, type='int', action='store')
Expand All @@ -45,14 +45,9 @@
print('ERROR: Input normal file {0} cannot be found.'.format(opts.normal_pileup))
sys.exit(1)

if opts.markers:
MARKER_FILE = opts.markers
markers_file = find_markers_file(opts, '.txt')

if not os.path.exists(MARKER_FILE):
print('ERROR: Marker file {0} cannot be find.'.format(MARKER_FILE))
sys.exit(2)

Markers = get_markers(MARKER_FILE)
Markers = get_markers(markers_file)
COVERAGE_THRESHOLD = opts.min_cov
MMQ = opts.min_mapping_quality
MBQ = opts.min_base_quality
Expand Down
Loading