Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract additional metadata from VCF files #464

Open
hammer opened this issue Feb 14, 2021 · 3 comments
Open

Extract additional metadata from VCF files #464

hammer opened this issue Feb 14, 2021 · 3 comments
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc IO Issues related to reading and writing common third-party file formats

Comments

@hammer
Copy link
Contributor

hammer commented Feb 14, 2021

##INFO

  • These fields are (usually?) per variant and can be stored as data variables with (variants) dimension prefixed with variant_. We can punt cases when Number is not 1 for now.
  • We may want to consider specifically looking for some of the reserved keys enumerated in the spec.
  • In particular, we may want to parse VEP annotations nicely.
  • Filed INFO property for Variant class brentp/cyvcf2#192 to make it a bit easier to explore these fields with cyvcf2.

##FORMAT

  • These fields are per genotype call and can be stored as data variables with (variants, samples) and sometimes (ploidy) dimensions. Again, we can punt cases when Number is not 1 for now.
  • We may want to consider specifically looking for some of the reserved keys enumerated in the spec.
  • cyvcf2 makes it easy to see which FORMAT fields are available for each variant with v.FORMAT
  • Some of the standard fields are also available as properties, with slight differences in representation, e.g. missing values in the VCF file are ., inv.format('DP') they are -2147483648, and in v.gt_depths they are -1.
# v.format('DP')
v.gt_depths

# v.format('AD')
list(zip(v.gt_ref_depths, v.gt_alt_depths))

# v.format('GQ')
v.gt_quals

# v.format('PL')
list(zip(v.gt_phred_ll_homref, v.gt_phred_ll_het, v.gt_phred_ll_homalt))

##CONTIG

  • It would be nice to get the assembly name and contig length too.
  • vcf.seqnames has contig names
  • vcf.seqlens has contig lengths
  • vcf['CONTIG'] should get contig header lines but it's giving me a KeyError; some kind of encoding issue, I guess.
  • Looking at the results of vcf.header_iter(), though, it appears that cyvcf2 is not parsing out the assembly field of the contig header lines, so we will have to use vcf.raw_header or patch cyvcf2.

##SAMPLE, ##INDIVIDUAL

  • These may be used in some larger projects?
@hammer hammer added IO Issues related to reading and writing common third-party file formats data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc labels Feb 14, 2021
@timothymillar
Copy link
Collaborator

Somewhat related to this, I've been working on code to convert between indices and genotype calls for VCF fields of length 'G'.
These functions can handle arbitrary allele counts and ploidy (including mixtures) up to the point of overflowing the index.

I don't currently have a specific feature to add with these (hence no PR) but I'm likely to be working with genotype posterior distributions in future (stored in the GP format field).

@tomwhite
Copy link
Collaborator

tomwhite commented Mar 4, 2021

Thanks @timothymillar, this would be a good addition in the future. What do you mean by "up to the point of overflowing the index"?

@timothymillar
Copy link
Collaborator

A large enough combination of ploidy and n_alleles will result in an index that is too large for an int64. But this shouldn't be a problem for realistic values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

No branches or pull requests

3 participants