Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windowing fails for datasets where contigs lack variant sites #1268

Open
percyfal opened this issue Oct 10, 2024 · 1 comment
Open

Windowing fails for datasets where contigs lack variant sites #1268

percyfal opened this issue Oct 10, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@percyfal
Copy link

The function sgkit.window_by_position fails if the input dataset has contigs that lack variant sites, throwing an IndexError.

Traceback
IndexError                                Traceback (most recent call last)
Cell In[36], line 1
----> 1 ds = sgkit.window_by_position(ds, size=10)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:218, in window_by_position(ds, size, step, offset, variant_contig, variant_position, window_start_position, merge)
    214 positions = ds[variant_position].values
    215 window_start_positions = (
    216     ds[window_start_position].values if window_start_position is not None else None
    217 )
--> 218 return _window_per_contig(
    219     ds,
    220     variant_contig,
    221     merge,
    222     _get_windows_by_position,
    223     size,
    224     step,
    225     offset,
    226     positions,
    227     window_start_positions,
    228 )

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:320, in _window_per_contig(ds, variant_contig, merge, windowing_fn, *args, **kwargs)
    318 contig_window_stops = []
    319 for i, contig in enumerate(get_contigs(ds)):
--> 320     starts, stops = windowing_fn(
    321         contig, contig_bounds[i], contig_bounds[i + 1], *args, **kwargs
    322     )
    323     contig_window_starts.append(starts)
    324     contig_window_stops.append(stops)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:429, in _get_windows_by_position(contig, start, stop, size, step, offset, positions, window_start_positions)
    426 contig_pos = positions[start:stop]
    427 if window_start_positions is None:
    428     # TODO: position is 1-based (or use offset?)
--> 429     pos_starts = np.arange(offset, contig_pos[-1] + offset, step=step)
    430     window_starts = np.searchsorted(contig_pos, pos_starts) + start
    431     window_stops = np.searchsorted(contig_pos, pos_starts + size) + start

IndexError: index -1 is out of bounds for axis 0 with size 0

This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.

Minimum working example

The VCF below is a modified version of the file sgkit/tests/data/sample.vcf where I added the contig key to the header:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##contig=<ID=19,length=1000>
##contig=<ID=20,length=40000>
##contig=<ID=21,length=10000>
##contig=<ID=X,length=1000>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
19	111	.	A	C	9.6	.	.	GT:HQ	0|0:10,15	0|0:10,10	0/1:3,3
19	112	.	A	G	10	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
20	4370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	7330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3:.,.
20	10696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4:.,.
20	30237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:.:56,60	0|0:48:4:51,51	0/0:61:2:.,.
20	34567	microsat1	G	GA,GAC	50	PASS	NS=3;DP=9;AA=G;AN=6;AC=3,1	GT:GQ:DP	0/1:.:4	0/2:17:2	./.:40:3
20	35237	.	T	.	.	.	.	GT	0/0	0|0	./.
X	10	rsTest	AC	A,ATG,C	10	PASS	.	GT	0	0/1	0|2

Convert to Zarr

bgzip sample.extrachrom.vcf 
tabix sample.extrachrom.vcf.gz
vcf2zarr convert sample.extrachrom.vcf.gz sample.extrachrom.vcz

and run code below to reproduce

import sgkit
import numpy as np

ds = sgkit.load_dataset("sample.extrachrom.vcz")
ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")
ds = sgkit.window_by_position(ds, size=10)
@jeromekelleher jeromekelleher added the bug Something isn't working label Oct 11, 2024
@jeromekelleher
Copy link
Collaborator

Thanks for the bug report @percyfal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants