-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimum viable sgkit dataset #748
Conversation
@jeromekelleher I'm thinking about how to cope with the |
Let's get rid of UUID, it's pointless. We've never used it and it's based on an unrealistic idea that datasets get made once and never changed. So, just hack around it in whatever way is simplest in the short term. |
Ok I've removed For ancestral allele - I'm thinking we provide a method on the dataset that takes either an array of ancestral allele values and converts it to the needed index. Could also accept a sequence that the alleles are extracted from at the site's positions? |
Both would be useful, as ancestral alleles are sometimes provided as a FASTA file by ensembl (e.g. http://ftp.ensembl.org/pub/release-108/fasta/ancestral_alleles/). I guess we would need to check that they were of the same length as the sequence length. The approach would be similar to the reference_sequence stuff, I guess, other than we don't have to store it. |
There is no sequence length stored in sgkit (makes sense as a VCF has no such concept. We should check that the sequence value at each site is one of the alleles though, right? FASTA won't work for indels right? Can't get alleles from the FASTA for them without knowing their length. |
There is sometimes a VCF sequence length definition, I think: you can get it using But yes, FASTA is no good for indels. In general, using the VCF |
At the moment the concept is read-only from sgkit. Any use case that requires changing the dataset will need to be thought through as where possible that should be an sgkit function/workflow unless it is very tsinfer specific. |
Right, so the question is whether SGkit should have
(https://samtools.github.io/hts-specs/VCFv4.2.pdf). I guess at the moment that data is not stored in an SGkit data file, although is is returned by htslib/cyvcf2 (annoyingly undocumented, though: brentp/cyvcf2@a138fad) |
We should change sgkit to store it if it's in the VCF. Some previous discussion here: https://github.com/pystatgen/sgkit/issues/464 |
We should be able to derive VCF-spec defined assembly and contig information (including length) from the sgkit dataset, so let's follow that up as an upstream improvement. |
I've opened https://github.com/pystatgen/sgkit/pull/946 to add contig lengths to the dataset. Assembly information is not exposed by cyvcf2 so I haven't added that yet. |
ff30e85
to
0a0f60c
Compare
cf2e67d
to
9821f20
Compare
Trying to sort out dependency issues here as sgkit has upper bounds on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, LGTM as a minimum viable start. Caught a few small things.
@@ -703,7 +694,6 @@ def __str__(self): | |||
("format_name", self.format_name), | |||
("format_version", self.format_version), | |||
("finalised", self.finalised), | |||
("uuid", self.uuid), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document this in the CHANGELOG I guess, but totally fine to day "this is gone, deal with it"
tsinfer/formats.py
Outdated
def __init__(self, path): | ||
self.path = path | ||
self.data = zarr.open(path, mode="r") | ||
self._num_sites, self._num_individuals, self.ploidy = self.data[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
genotypes_arr = self.data["call_genotypes"]
self._num_sites, self._num_individuals, self.ploidy = genotypes_arr.shape
Currently line breaking is a bit eye hurty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsinfer/formats.py
Outdated
@property | ||
def sites_ancestral_allele(self): | ||
try: | ||
return self.data["sites/ancestral_allele"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be more sgkit-like in our choice of variable names here, I guess variant_ancestral_allele
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it likely we end up with variant_ancestral_allele_index
so will put that for now, but final decision will be part of #764
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
@property | ||
def sites_genotypes(self): | ||
gt = self.data["call_genotype"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not obvious to me what this is doing, can we get a comment please? I.e., is it making a full copy and returning as a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsinfer/formats.py
Outdated
@property | ||
def metadata(self): | ||
try: | ||
return self.data.attrs["metadata_schema"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, should be "metadata"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably simplest to drop python 3.7 here from CI I'd say? |
It's not that simple, the circle CI tests are also failing, I can recreate locally and it seems to be an issue with xarray on Py3.7 (which is unsupported). Using Py3.8 and doing an unpinned install of the circle CI deps results in:
Along with failing tests. Fixing these versions then passes - so I need to upgrade the circle CI tests to 3.8 and then be very careful about the pinning of sgkits deps - this should fix circle CI. Then remove the 3.7 tests on the github workflow. |
Codecov Report
@@ Coverage Diff @@
## main #748 +/- ##
==========================================
+ Coverage 93.34% 93.46% +0.12%
==========================================
Files 17 17
Lines 5361 5691 +330
Branches 984 1008 +24
==========================================
+ Hits 5004 5319 +315
- Misses 235 246 +11
- Partials 122 126 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I'm going to unsubscribe here @benjeffery - can you ping me for final review when ready please? |
(Looks like a fun time trying to get the package version problem solved! 🤮 ) |
236ac2b
to
61e1d70
Compare
61e1d70
to
ae2b087
Compare
@jeromekelleher Have you seen this error before? "The process cannot access the file because it is being used by another process" It's in the CLI tests that I haven't touched! |
Yes, I think I'm having the same problem over in #769 ... |
ae2b087
to
770b3b1
Compare
The |
I'm not sure we need vcf support in sgkit here? |
It's nice to confirm that the version of sgkit creates a dataset we can read, we could store the dataset, but would need to update it when sgkit changed anything. |
Right, fair enough. The sgkit on conda should be "batteries included" anyway and have all the functionality. There's no advantage to getting it via pip because cyvcf still won't be available. |
@@ -10,7 +12,8 @@ pytest==7.2.0 | |||
pytest-xdist==3.0.2 | |||
python-lmdb==1.3.0 | |||
seaborn==0.12.1 | |||
sortedcontainers==2.4.0 | |||
sgkit[vcf]==0.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete the [vcf] here I think
7b04422
to
db8b8ef
Compare
ff0b28c
to
fdebc53
Compare
@jeromekelleher Would be good to get this merged before #778 as it contains the sgkit testing infrastructure. We have issues filed for the follow-up work. |
Go for it! |
@Mergifyio rebase |
✅ Branch has been successfully rebased |
fdebc53
to
38cd717
Compare
Manual merge due to mergify config change. |
Round trips an sgkit dataset.