Skip to content

A minimal package for saving and reading large HDF5-based chunked arrays.

License

Notifications You must be signed in to change notification settings

portugueslab/split_dataset

Repository files navigation

Python Version PyPI Tests Coverage Status Code style: black License: GPL v3

A minimal package for saving and reading large HDF5-based chunked arrays.

This package has been developed in the Portugues lab for volumetric calcium imaging data. split_dataset is extensively used in the calcium imaging analysis package fimpy; The microscope control libraries sashimi and brunoise save files as split datasets.

napari-split-dataset support the visualization of SplitDatasets in napari.

Why using Split dataset?

Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. zarr arrays; however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application split_dataset has been developed for (see this discussion on the limitation of zarr arrays).

Structure of a split dataset

A split dataset is contained in a folder containing multiple, numbered h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks. The h5 files are saved using the flammkuchen library (ex deepdish). Each file contains a dictionary with the data under the stack keyword.

SplitDataset objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.

Minimal example

# Load a  SplitDataset via a SplitDataset object:
from split_dataset import SplitDataset
ds = SplitDataset(path_to_dataset)

# Retrieve data in an interval:
data_array = ds[n_start:n_end, :, :, :]

Creating split datasets

New split datasets can be created with the split_dataset.save_to_split_dataset function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with flammkuchen correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is what happens in sashimi)

TODO

  • provide utilities for partial saving of split datasets
  • support for more advanced indexing (support for step and vector indexing)
  • support for cropping a SplitDataset
  • support for resolution and frequency metadata

History

0.4.0 (2021-03-23)

  • Added support to use a SplitDataset as data in a napari layer.

...

0.1.0 (2020-05-06)

  • First release on PyPI.

Credits

Part of this package was inspired by Cookiecutter and this template.

.. _Portugues lab: .. _Cookiecutter: .. _this:

About

A minimal package for saving and reading large HDF5-based chunked arrays.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages