-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition_into_regions
fails to partition correctly in when absent contigs are listed in the header.
#1169
Comments
Hard coding the return value of |
Could reindex with tabix as a quick workaround, and see if the code works in that case? |
Yes, I'd have to copy the VCF somewhere with write perms. Will try once this parse is complete, its over half way done. |
Right - can't have an index anywhere else except in the same dir. Sigh. |
I think this can be fixed by adding some functionality to cyvcf2. I made a start on something similar here: Alternatively, we are actually fully parsing the indexes locally, so could probably derive the counts per-contig there. |
Can you dig in a bit more here @benjeffery, and maybe give us the output of |
Ah, I'm just after hitting this issue now. CSI indexed VCFs with multiple contigs in the header are not being treated correctly. |
Marking this as a bug, as there's a good chance of data-loss as a result of this. |
Closing in favour of #1201 ( think this is an instance of that bug) |
I have a 56GB VCF which contains the variants for part of chr20, however the header of the VCF lists over 2000 contigs! This confuses
partition_into_regions
which returns a region for each of the absent contigs and leaves all the actual variants in the VCF in one huge part. It also somewhat splits up the first (absent) contig.Digging into the code we have:
The VCF I am using is indexed with a
csi
file, so this is where the confusion is happening as sgkit is using the header to determine what csi means by contig0
.Here (samtools/bcftools#816 (comment)) they say that the tabix metadata is in the CSI
aux
field, and indeed I can see a stringchr20
in that field (along with other things) so maybe we can try to read the contig names from the CSI? Will dig a bit further, but thought worth reporting in case anyone has experience here.The text was updated successfully, but these errors were encountered: