-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft bed2zarr code #281
base: main
Are you sure you want to change the base?
Draft bed2zarr code #281
Conversation
Draft code to convert Bed3-format to Zarr.
Thanks for this @percyfal, and welcome to sgkit-dev 👋! Some high-level thoughts:
Does that clarify at all? |
As a follow-up, we probably would want to write a spec somewhere that lays out exactly the array names etc, but I don't think there's any need initially. Let's get something working first and spec it out later. |
Thanks for the feedback. Let me address your points one by one.
Once this tool works, my long-term wish is to have functionality in sgkit that corrects for summary statistics for accessible sites (be it in windows or not) but that also summarizes statistics across features from annotations. I would be happy to draft that too if that is a good fit, but I think in this case opening a discussion on sgkit prior to development would be beneficial? |
To clarify: with chromosomes as different datasets I mean having a group for a bed feature (say mask) where a dataset is created within the group, instead of as in the current draft where I make one long array equal to the total genome length. Some organisms I work on have very fragmented genomes (+100k scaffolds/contigs). |
I'm hoping that BedZarr can be as lightweight as possible and purely just take the input text file and convert it to a Zarr with the corresponding arrays. I think we can then place the burden checking the validity of these intervals on annotating a VCF Zarr with the things we're interested in. So, suppose we have a BED that specifies an accessibility mask, accessibility.bed. The workflow might look like:
then in sgkit we have something like access_zarr = zarr.open("accessibility.zarr")
# NB: These are names from the top of my head, not an actual proposal!
ds = sgkit.add_accessibility_mask(ds, start=access_zarr["start"], end=access_zarr["end"]) That is, sgkit just takes the start and end coordinates of the intervals as arrays, and is decoupled from the BedZarr format. Sgkit then adds a new boolean Does that help clarify? |
Ok, I see what you mean; basically convert the BED file columns to separate arrays:
would translate to
in Zarr. However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites. If you want a genome-wide summary of pi, you need to know the accessibility up unto the end of the chromosome.
If *=variant site, 0=accessible, 1=masked, and chopping up in 10bp-windows, the first window has 1 variant site but the actual window size is 5bp, not 10. I'll focus on the BED conversion for now, provided we agree on the output format, and deal with the other stuff later. |
I see - well that's a different problem. Let's just convert the file to Zarr first and worry about how to use it later! |
- use pandas for reading - write to isolated zarr archive - map BED columns to arrays named after BED specification (hts-specs)
Ok, I updated the code to translate the three mandatory fields |
This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns? |
In that case I guess we could adopt the VCZ approach and generate a ID mapping to the
Sure. The remaining columns are well-defined (table 2 in spec) and would make drafting a spec pretty straight-forward:
Would you have the spec end up in a schema or be stored separately, as you do with the VCFZarr spec? This begs the question: I haven't figured out where you set the schema, such as |
- guess BED file type - add draft schema - add tests
I went ahead and implemented most needed functionality based on code in I have added:
Remaining issues:
|
Thanks @percyfal! We have an sgkit developers call every second Monday at 1600 UK time, and the next one is on Monday (30th). If you're available, it would be great to discuss this there? I can send you the details if you're interested. |
- simplification of main bed2zarr function - encode features, contigs as Categories - add specs for BED9-BED12
@jeromekelleher I have refactored and simplified the code somewhat and added specs for most cases. I have left out support for the BED9 and BED12 formats for now since the encoding of columns 9-12 require some thought. To reiterate, the BED specifications for the last 4 columns are
The tricky(?) thing about columns 11 and 12 is that every record (line) consists of arrays of potentially different lengths - I haven't added any documentation yet but could do so if required before merging the PR. I added test data files and I have modified |
This looks great @percyfal! I agree we shouldn't bother with the 9 and 12 col formats, they seem deeply obscure to me. I'm not sure what's happened with du here, but I think this test is more trouble that it's worth and we should think about changing it to working on some known pre-written files or something. I'll take a close look over the coming days, but I think we can merge this pretty much as-is. |
Draft code to convert Bed3-format to Zarr. Addresses sgkit-dev/sgkit#1219 where we briefly discuss the need for a tool to convert bed to Zarr. As suggested, I started a draft (at least that was my interpretation, or would you have preferred an issue to start with @jeromekelleher?).
Before going too far in development, I submit this draft to discuss some of the issues I have at the moment:
bed2zarr [OPTIONS] BED_PATH ZARR_PATH BED_ARRAY
whereBED_ARRAY
is the name of the Zarr dataset.BED_ARRAY
isbed_mask
where the BED file is store as a 0/1-array, along withbed_mask_contig
which contains the contig for each site, modeled aftervariant_contig
. Both these arrays equal the total genome length.I guess one could add support for other BED formats later on. Here, I focus on the more specific task of generating 0/1-based sequence masks to indicate missing data / genome accessibilty.