GitHub - knausb/bam_processing: Process SAM/BAM files accroding to the 1k genomes methods

bam_processing

Process SAM/BAM files accroding to the 1k genomes methods

The 1,000 genomes project has posted its bam processing methods online. This repository documents how I implement their methodology in our projects on our SGE computing facility.

The process has several steps.

Read alignment, including SAM to BAM conversion, sort and index. This step now includes fixing matepair data, adding MD and NM tags as well as marking PCR duplicates.
gVCF creation

Read alignment

Read alignment is currently performed using bwa, however there are a number of options here. Read alignment uses fastq (.fastq.gz) files as input and a SAM file (.sam) is output. Subsequent to read alignment, a few steps are performed with SAMtools. First we fix mate information and add the MD tag. This will generate an out file named _nsort. This file is resorted resulting in a _fixed.bam file. This fixed file then has PCR duplicates marked, is indexed and this step is complete. The program samtools stats is called on intermediate *.sam and *.bam files in order to sanity check the processing.

Retained file: *_mrkdup.bam.

Removed files: *_nsort, *_nsort_tmp, *_csort_tmp, *_fixed.bam.. (_tmp files should be automatically removed.)

Variant calling

Variant calling is currently performed using several steps from the GATK. First, a genomic variant call format (gVCF) file is created for each sample. Processing of each sample independently allows for the specification of different ploidys for different samples. Then these gVCF files are used to call variants. We use the HaplotypeCaller with the --emitRefConfidence GVCF option. This results in a .g.vcf file. Once we have a set of gVCF files we can call GenotypeGVCFs to call variants which results in a VCF file.

Retained files: *.g.vcf file for each sample and a *.vcf file containing the final variants.

Other options?

Use SLURM instead of SGE? @zkamvar has created his own solution, called read-processing, for their HPC. Even if you don't use SLURM, its been my experience that his code typically provides solid solutions. Give it a look!

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
README.md		README.md
bam_improvement.sh		bam_improvement.sh
bams.txt		bams.txt
chreadgroup.sh		chreadgroup.sh
fastq_rename.pl		fastq_rename.pl
fqdump.sh		fqdump.sh
gatk_hc.sh		gatk_hc.sh
genotype_gVCF.sh		genotype_gVCF.sh
genotype_gVCF_2020.sh		genotype_gVCF_2020.sh
gvcfs.list		gvcfs.list
mkgvcf.sh		mkgvcf.sh
mkref.sh		mkref.sh
mpileup_genes.sh		mpileup_genes.sh
mrkdup.sh		mrkdup.sh
prefetch.sh		prefetch.sh
read_alignment_2020.sh		read_alignment_2020.sh
read_alignment_2020_tmp.sh		read_alignment_2020_tmp.sh
read_alignment_2020_tmp_pe.sh		read_alignment_2020_tmp_pe.sh
read_alignment_old.sh		read_alignment_old.sh
rm_duplicates.sh		rm_duplicates.sh
samples.txt		samples.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bam_processing

Read alignment

Variant calling

Other options?

About

Releases

Packages

Languages

knausb/bam_processing

Folders and files

Latest commit

History

Repository files navigation

bam_processing

Read alignment

Variant calling

Other options?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages