-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Chunk all TimeSeries data and timestamps by default #1945
Comments
This is tougher than it sounds. We tried the default compression settings for a few large TimeSeries datasets early on and found that it provided chunks that were very long in time and narrow in channels, and were problematic for a number of use-cases including visualization with Neurosift. These default settings just don't really work well for us, and we need to be more thoughtful about our chunk shapes. We have implemented code that automatically wraps all TimeSeries with DataIOs here: The chunk shapes are determined with these types of considerations in mind. I'd be fine with thinking about migrating some of this into HDMF, esp since these tools are useful on their own outside of the rest of NeuroConv. Thoughts on this, @CodyCBakerPhD ? |
|
came searching if y'all had talked about this as i am working on doing default chunking and compression. curious what would be the blockers to applying some sensible defaults for both chunking and compression? most neuro data is extremely compressible, and the format gives pretty decent hints about the sizes and shapes to expect. i think your average neuroscientist is probably not aware and likely doesn't care about chunking/compression, but they probably do care if it takes an order of magnitude time and space to use their data. seems benchmarkable/optimizable? like write a simple chunk size guesser that eg. targets eg. 512kib chunks and measure compression ratio and io speed? would be happy to help if this is something we're interested in :) |
@sneakers-the-rat yes, we have done a lot of research on this. Here's the summary compression ratio: we rarely ever see an order of magnitude. It's usually a savings of 20-50%, which is great, but I don't want to over-promise. compressor: Of the HDF5-compatible compressors, zstd is a bit better than gzip all-around (read speed, write, speed, and compression ratio). However, it does not come as default with HDF5 and requires a bit of extra installation. You can do better with other compressors that you can use in Zarr but not easily in HDF5. size: The official HDF Group recommendation used to be 10 KiB, which works well for on-disk but does not work well for streaming applications. 10 MiB is much better if you want to stream chunks from the cloud. shape: This one is tricky. In h5py setting We have implemented all of this in NeuroConv, and all of it is automatically applied by |
thx for the info :) yes you're right i was testing with gzip earlier and got ~50% across a half dozen files (small sample), was remembering the results i got w/ lossless and lossy video codecs on video data, my bad. makes sense! sry for butting in, onwards to the glorious future where we untether from hdf5 <3 |
What would you like to see added to PyNWB?
Chunking generally improves read/write performance and is more cloud-friendly (and LINDI-friendly). (Related to NeurodataWithoutBorders/lindi#84.)
I suggest that
TimeSeries.__init__
wraps data and timestamps with aH5DataIO
orZarrDataIO
, depending on backend, withchunks=True
, if the input data/timestamps are not already wrapped. We can add flags to saychunk_data=True
,chunk_timestamps=True
that users can flip to turn off this behavior. A challenge will be figuring out the backend withinTimeSeries
...We could use the h5py defaults for now, and more targeted defaults for
ElectricalSeries
data /TwoPhotonSeries
data later.I believe all Zarr data are already chunked by default.
Is your feature request related to a problem?
Contiguous HDF5 datasets have slow performance in the non-contiguous dimensions and are difficult to stream or use with Zarr/LINDI.
What solution would you like?
Chunking of time series data by default.
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct
The text was updated successfully, but these errors were encountered: