A minimal package for saving and reading large HDF5-based chunked arrays.
This package has been developed in the Portugues lab
for volumetric calcium imaging data. split_dataset
is extensively used in the calcium imaging analysis package fimpy
; The microscope control libraries sashimi
and brunoise
save files as split datasets.
napari-split-dataset
support the visualization of SplitDatasets in napari
.
Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. zarr arrays; however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application split_dataset
has been developed for (see this discussion on the limitation of zarr arrays).
A split dataset is contained in a folder containing multiple, numbered h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks.
The h5 files are saved using the flammkuchen library (ex deepdish). Each file contains a dictionary with the data under the stack
keyword.
SplitDataset
objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.
# Load a SplitDataset via a SplitDataset object:
from split_dataset import SplitDataset
ds = SplitDataset(path_to_dataset)
# Retrieve data in an interval:
data_array = ds[n_start:n_end, :, :, :]
New split datasets can be created with the split_dataset.save_to_split_dataset
function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with flammkuchen
correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is what happens in sashimi)
- provide utilities for partial saving of split datasets
- support for more advanced indexing (support for step and vector indexing)
- support for cropping a
SplitDataset
- support for resolution and frequency metadata
- Added support to use a
SplitDataset
as data in anapari
layer.
...
- First release on PyPI.
Part of this package was inspired by Cookiecutter and this template.
.. _Portugues lab
:
.. _Cookiecutter:
.. _this: