denovo_scoring.py extension - failure to read sample IDs in VCF #292

McClatcheyM · 2022-09-05T18:47:18Z

I am currently working with manta/1.5.0 for SV calling of short read human WGS data - accessible to me as Illumina Isaac-aligned BAMs - for family-based (trio) calling. I am working in a (remote) trusted research environment in which manta has been installed and implemented by root administrators. For data security reasons, I cannot provide screenshots or verbatim output.

I am keen to generate a benchmark set of denovo SVs in my data. I've been working on a single test trio in a cohort ~120
After performing the configuration and execution steps with the three family bams as input. Absolute filepaths for bams were given in the config step.

I have subsequently attempted to run denovo_scoring.py on the output diploidSV.vcf.gz VCF as:
python $MANTA_INSTALL_FOLDER/libexec/denovo_scoring.py <proband_id> <father_id> <mother_id> as per README.md

The job terminates spontaneously and with the .stderr output of:
The sample ID <proband_id>,<father_id>,<mother_id> does not exist in the vcf. Program exits

In diploidSV.vcf.gz, the sample IDs appear have been correctly parsed and appear in the final row of the VCF as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT <PROBAND_ID> <FATHER_ID> <MOTHER_ID>

The IDs are correct and appear in the same order as the input. They are in a common format - "AB1234-CDE_F56", indicating fixed positions of alpha-numerical and special characters respectively. The parsed IDs in the header are free of leading file paths or file extension.

I'm not sure why the script is not detecting these as the correct sample ids, as they are clearly present.
Any advice from original engineers or community would be appreciated.

The text was updated successfully, but these errors were encountered:

Kmanjor · 2022-12-15T07:12:29Z

I encountered the same error, unzipped the diploidSV.vcf.gz file and ran successfully.

yr542 · 2023-03-29T19:58:21Z

Could you show the code you used to run denovo_scoring.py?

Kmanjor · 2023-03-29T19:58:43Z

您好，我是李桐，已收到您的邮件，祝您天天开心，工作愉快。This is an automatic reply, confirming that your e-mail was received. Thank you.

Kmanjor · 2023-03-30T17:27:34Z

#!/usr/bin/env python2

Manta - Structural Variant and Indel Caller

Copyright (c) 2013-2019 Illumina, Inc.

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation, either version 3 of the License, or

at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program. If not, see http://www.gnu.org/licenses/.

import sys
from os import path
from os.path import exists, abspath, dirname, basename, splitext, join

def check_genotype(probandGT, fatherGT, motherGT):
isConsistent = False

fatherGTItems = fatherGT.split('/')
motherGTItems = motherGT.split('/')
for it1 in fatherGTItems:
    for it2 in motherGTItems:
        temp = [it1, it2]
        temp.sort()
        GT = temp[0]+'/'+temp[1]
        if GT == probandGT:
            isConsistent = True
            break

return isConsistent

def add_dq(tokens, probandIx, dq):
for ix in xrange(9, len(tokens)):
if (ix == probandIx):
tokens[ix] += ":%s" % dq
else:
tokens[ix] += ":."

def process_vcf(vcfFile, probandID,
fatherID, motherID):

vcfFile = abspath(vcfFile)
dataDir = dirname(vcfFile)
filePrefix = splitext(basename(vcfFile))[0]
outFile = join(dataDir, filePrefix+".de_novo.vcf")
statsFile = join(dataDir, filePrefix+".de_novo.stats.txt")

fpOut = open(outFile, 'wb')
fpStats = open(statsFile, 'wb')

countPassed = 0
countFiltered = 0
consistencyDict = {}

# parser
isFormatAdded = False
isIxFound = False
colNameLine = ""
probandIx = -1
fatherIx = -1
motherIx = -1

fpVcf = open(vcfFile, 'rb')
for line in fpVcf:
    if line[0] == '#':
        if not(isFormatAdded) and (line[:8] == "##FORMAT"):
            fpOut.write("##FORMAT=<ID=DQ,Number=1,Type=Integer,Description=\"De novo quality score\">\n")
            isFormatAdded = True
        fpOut.write(line)
        colNameLine = line
        continue
    elif not(isIxFound):
        # parse format line to get the columns of proband & parents
        tokens = colNameLine.split()
        for ix in xrange(len(tokens)):
            if tokens[ix] == probandID:
                probandIx = ix
            elif tokens[ix] == fatherID:
                fatherIx = ix
            elif tokens[ix] == motherID:
                motherIx = ix

        wrongID = ""
        if probandIx == -1:
            wrongID = probandID
        if fatherIx == -1:
            wrongID += (',%s' % fatherID)
        if motherIx == -1:
            wrongID += (',%s' % motherID)

        if wrongID:
            errMsg = ('The sample ID %s does not exist in the vcf.'
                      % wrongID)
            sys.stderr.write(errMsg + '\nProgram exits.')
            sys.exit(1)

    tokens = line.split()
    format = tokens[8]

    items = format.split(':')
    GTix = -1
    for ix in xrange(len(items)):
        if items[ix] == "GT":
            GTix = ix

    items = tokens[probandIx].split(':')
    probandGT = items[GTix]

    items = tokens[fatherIx].split(':')
    fatherGT = items[GTix]

    items = tokens[motherIx].split(':')
    motherGT = items[GTix]

    # add DQ to the format string
    format += ":DQ"
    isConsistent = check_genotype(probandGT, fatherGT, motherGT)
    if not(isConsistent):
        # DQ set to 60
        add_dq(tokens, probandIx, "60")

        # stats
        filter = tokens[6]
        if filter.upper() == "PASS":
            countPassed += 1
        else:
            countFiltered += 1

        GTstring = probandGT + '-' + fatherGT + '-' + motherGT
        if not(GTstring in consistencyDict):
            consistencyDict[GTstring] = 0
        consistencyDict[GTstring] += 1
    else:
        # DQ set to 0
        add_dq(tokens, probandIx, "0")

    newLine = ""
    for i in xrange(8):
        newLine += tokens[i] + "\t"
    newLine += format
    for i in xrange(9, len(tokens)):
        newLine += "\t" + tokens[i]
    fpOut.write(newLine+"\n")

fpVcf.close()
fpOut.close()

fpStats.write("# of passed SVs: %s\n" % (countPassed))
fpStats.write("# of filtered SVs: %s\n" % (countFiltered))
fpStats.write("probandGT-fatherGT-motherGT\tcounts\n")
genotypes = consistencyDict.keys()
genotypes.sort()
for gt in genotypes:
    fpStats.write("%s\t%s\n" % (gt, consistencyDict[gt]))
fpStats.close()

if name=='main':

usage = "denovo_scoring.py <vcf file> <proband sample ID> <father sample ID> <mother sample ID>\n"
if len(sys.argv) <= 4:
    sys.stderr.write(usage)
    sys.exit(1)

vcfFile = sys.argv[1]
probandID = sys.argv[2]
fatherID = sys.argv[3]
motherID = sys.argv[4]

if not(exists(vcfFile)):
    errMsg = ('The file %s does not exist.'
              % vcfFile)
    sys.stderr.write(errMsg + '\nProgram exits.')
    sys.exit(1)

process_vcf(vcfFile, probandID,
            fatherID, motherID)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

denovo_scoring.py extension - failure to read sample IDs in VCF #292

denovo_scoring.py extension - failure to read sample IDs in VCF #292

McClatcheyM commented Sep 5, 2022

Kmanjor commented Dec 15, 2022

yr542 commented Mar 29, 2023

Kmanjor commented Mar 29, 2023 via email

Kmanjor commented Mar 30, 2023

denovo_scoring.py extension - failure to read sample IDs in VCF #292

denovo_scoring.py extension - failure to read sample IDs in VCF #292

Comments

McClatcheyM commented Sep 5, 2022

Kmanjor commented Dec 15, 2022

yr542 commented Mar 29, 2023

Kmanjor commented Mar 29, 2023 via email

Kmanjor commented Mar 30, 2023

Manta - Structural Variant and Indel Caller

Copyright (c) 2013-2019 Illumina, Inc.

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation, either version 3 of the License, or

at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program. If not, see http://www.gnu.org/licenses/.