Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem with writing gz files #8

Open
jrm5100 opened this issue Jan 27, 2023 · 3 comments
Open

Potential problem with writing gz files #8

jrm5100 opened this issue Jan 27, 2023 · 3 comments

Comments

@jrm5100
Copy link

jrm5100 commented Jan 27, 2023

I'm using fgoxide to write a test involving .vcf.gz files.

I've defined some input lines like

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();

let input = tempdir.path().join("input.vcf.gz");
let input_data = vec![
        r#"##fileformat=VCFv4.0	"#,
        r#"##build_id=20201027095038"#,
        r#"##Population=https://www.ncbi.nlm.nih.gov/biosample/?term=GRAF-pop"#,
        r#"##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total allele count for the population, including REF">"#,
        r#"##FORMAT=<ID=AC,Number=A,Type=Integer,Description="Allele count for each ALT allele for the population">"#,
        r#"#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMN10492695	SAMN10492696	SAMN10492697	SAMN10492698	SAMN10492699	SAMN10492700	SAMN10492701	SAMN10492702	SAMN11605645	SAMN10492703	SAMN10492704	SAMN10492705"#,
        r#"NC_000001.9	144135212	rs1553120241	G	A	.	.	.	AN:AC	8560:5387	8:8	256:224	336:288	32:24	170:117	32:24	18:13	20:15	344:296	288:248	9432:6100"#,
        r#"NC_000001.9	144148243	rs2236566	G	T	.	.	.	AN:AC	5996:510	0:0	0:0	0:0	0:0	0:0	0:0	0:0	84:8	0:0	0:0	6080:518"#,
] ;

io.write_lines(&input, &input_data)?;

My test is failing with Error: bytes remaining on stream, but passing with files I've run manually.

If I extract the generated VCF file and re-compress it with bgzip

gunzip input.vcf.gz
bgzip -c input.vcf > input2.vcf.gz

My tool works. The same issue occurs when running the tool directly on the files (so the test itself isn't the problem, unless I've written the input file incorrectly).

The bgzip version is slightly larger.

There's no difference in the vcf version of each.

diff <(gzcat test_path/input.vcf.gz) <(gzcat test_path/input2.vcf.gz)

input.vcf.gz
input2.vcf.gz

@tfenne
Copy link
Member

tfenne commented Jan 27, 2023

@jrm5100 I wonder - how is your tool reading the VCF? I'm fairly sure that io::write_lines() will write a vanilla gzipped file if the extension is .gz - not a bgzipped file. Perhaps the tool you're running expects gzipped files to be bgzipped?

If that is the case I'd also be open to making io::write_lines() write bgzipped instead of plain gzipped files as that's probably more generally useful.

@jrm5100
Copy link
Author

jrm5100 commented Jan 28, 2023

Yeah, it just occurred to me the incompatibility might be with noodles-bgzf, and I guess the name might confirm that your hunch is correct.

That's what I get for trying to write up a strange issue on a Friday evening without thinking it through.

Not sure what the best approach is. Perhaps .vcf.gz should write bgzip files (since that is the norm for vcf), but I don't know that I'd want to encode too much logic into how file extensions are written.

@nh13
Copy link
Member

nh13 commented Jan 30, 2023

What about the following simple rust psuedo-code that could be refactored into its own function. If it's only for testing purposes, that should work. We tend to use this library for compression: https://github.com/sstadick/gzp

let tempdir = TempDir::new().unwrap();
let io = fgoxide::io::Io::default();
let input = tempdir.path().join("input.vcf.gz");
let writer = BufWriter::new(File::create(&input).unwrap());
let mut bgzf_writer = BgzfSyncWriter::new(writer, Compression::new(3));
bgzf_writer.write_all(b"@NAME\nGATTACA\n+\nIIIIIII\n").unwrap();
bgzf_writer.flush().unwrap();
drop(gz_writer);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants