Extract additional metadata from VCF files #464

hammer · 2021-02-14T13:22:12Z

VCF 4.2 spec
Example VCF file: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz
cyvcf2.pyx
- Header types: 'CONTIG', 'FILTER', 'FORMAT', 'GENERIC', 'INFO'
vcf_reader.py

##INFO

These fields are (usually?) per variant and can be stored as data variables with (variants) dimension prefixed with variant_. We can punt cases when Number is not 1 for now.
We may want to consider specifically looking for some of the reserved keys enumerated in the spec.
In particular, we may want to parse VEP annotations nicely.
Filed INFO property for Variant class brentp/cyvcf2#192 to make it a bit easier to explore these fields with cyvcf2.

##FORMAT

These fields are per genotype call and can be stored as data variables with (variants, samples) and sometimes (ploidy) dimensions. Again, we can punt cases when Number is not 1 for now.
We may want to consider specifically looking for some of the reserved keys enumerated in the spec.
cyvcf2 makes it easy to see which FORMAT fields are available for each variant with v.FORMAT
Some of the standard fields are also available as properties, with slight differences in representation, e.g. missing values in the VCF file are ., inv.format('DP') they are -2147483648, and in v.gt_depths they are -1.

# v.format('DP')
v.gt_depths

# v.format('AD')
list(zip(v.gt_ref_depths, v.gt_alt_depths))

# v.format('GQ')
v.gt_quals

# v.format('PL')
list(zip(v.gt_phred_ll_homref, v.gt_phred_ll_het, v.gt_phred_ll_homalt))

##CONTIG

It would be nice to get the assembly name and contig length too.
vcf.seqnames has contig names
vcf.seqlens has contig lengths
vcf['CONTIG'] should get contig header lines but it's giving me a KeyError; some kind of encoding issue, I guess.
Looking at the results of vcf.header_iter(), though, it appears that cyvcf2 is not parsing out the assembly field of the contig header lines, so we will have to use vcf.raw_header or patch cyvcf2.

##SAMPLE, ##INDIVIDUAL

These may be used in some larger projects?

The text was updated successfully, but these errors were encountered:

timothymillar · 2021-03-03T23:10:48Z

Somewhat related to this, I've been working on code to convert between indices and genotype calls for VCF fields of length 'G'.
These functions can handle arbitrary allele counts and ploidy (including mixtures) up to the point of overflowing the index.

I don't currently have a specific feature to add with these (hence no PR) but I'm likely to be working with genotype posterior distributions in future (stored in the GP format field).

tomwhite · 2021-03-04T09:40:52Z

Thanks @timothymillar, this would be a good addition in the future. What do you mean by "up to the point of overflowing the index"?

timothymillar · 2021-03-04T19:03:08Z

A large enough combination of ploidy and n_alleles will result in an index that is too large for an int64. But this shouldn't be a problem for realistic values.

hammer added IO Issues related to reading and writing common third-party file formats data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc labels Feb 14, 2021

tomwhite mentioned this issue Feb 23, 2021

VCF info and format fields #471

Merged

tomwhite mentioned this issue Mar 16, 2021

Support Number=G VCF fields #493

Closed

hammer mentioned this issue Mar 23, 2021

Enable indexing on datasets by default #473

Open

hammer mentioned this issue May 27, 2021

Use of int8 for variant_contig results in integer overflow with fragmented reference genomes #584

Closed

tomwhite mentioned this issue Oct 4, 2021

Mean of windowed popgen stats #662

Open

This was referenced Nov 1, 2022

Minimum viable sgkit dataset tskit-dev/tsinfer#748

Merged

Add contig_lengths dataset attribute if found in the VCF file #946

Merged

benjeffery mentioned this issue Nov 3, 2022

sgkit: Use sequence length from dataset tskit-dev/tsinfer#763

Open

tomwhite mentioned this issue Feb 21, 2023

Generate VCF header when writing, if no header is explicitly supplied #1021

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract additional metadata from VCF files #464

Extract additional metadata from VCF files #464

hammer commented Feb 14, 2021 •

edited

Loading

timothymillar commented Mar 3, 2021

tomwhite commented Mar 4, 2021

timothymillar commented Mar 4, 2021

Extract additional metadata from VCF files #464

Extract additional metadata from VCF files #464

Comments

hammer commented Feb 14, 2021 • edited Loading

##INFO

##FORMAT

##CONTIG

##SAMPLE, ##INDIVIDUAL

timothymillar commented Mar 3, 2021

tomwhite commented Mar 4, 2021

timothymillar commented Mar 4, 2021

hammer commented Feb 14, 2021 •

edited

Loading