Storing awkward arrays incrementally #1191

pzelasko · 2021-12-23T02:27:07Z

pzelasko
Dec 23, 2021

Hi there!

I'm building a library for speech data representation and processing for deep learning called lhotse. One of the aspects of speech data is its variable length, so features describing utterances will always have one dynamic dimension. I think it is very difficult to find a project that supports storing ragged arrays like these efficiently, and I was very intrigued to find about awkward array.

My question is, what is the preferred format for incremental writing of awkward arrays? I saw mentions of HDF5, arrow, feather, and parquet in the docs, but the examples all seemed to assume that the full array is known ahead of time. This is not the case with iterative feature extraction; we might have 10.000 hours of speech and want to store the arrays as we compute them for later model training.

I appreciate your help and suggestions.

jpivarski · 2021-12-23T15:13:01Z

jpivarski
Dec 23, 2021
Maintainer

For incremental writing, probably Parquet would be best: it would be in batches, but each batch could be a row group. Unfortunately, we don't have an interface set up to keep an output file handle open and write one row group at a time, though it wouldn't be a big modification of what we have.

On the other hand, I'm reluctant to add that interface because we're in the process of rewriting the file-writing for Awkward 2.0, so such a modification would have a short life.

Instead of writing row groups, what about writing a Parquet dataset as a collection of files? That's a standard, well-recognized format (see Arrow docs), especially if the collection of files has a METADATA file. The ak.to_parquet docs describe how to use ak.to_parquet.dataset (note the dot: it's a nested function), which wraps a set of already-written Parquet files into a dataset by adding the METADATA.

Would that work for you: one chunk (as much as will fit in memory) per file, and then treating those files as a dataset?

2 replies

pzelasko Jan 6, 2022
Author

Thanks for getting back to me so quickly. I've been away for some time.

Your suggestion sounds interesting, although it's a non-trivial modification of what we're doing at the moment: keeping the file handle open for writing, and writing one-by-one. Do you expect the Parquet solution to be roughly as fast as, or faster than, HDF5 for random reads?

jpivarski Jan 6, 2022
Maintainer

I wasn't suggesting keeping a file handle open–the method described above is to put chunks in separate files in the same directory, then adding a metadata file to that directory with a slightly different command. Then all the files in the directory act as one array.

The method of chunks-as-row-groups, rather than chunks-as-files, would require an open file handle, but that method hasn't been implemented. Only the chunks-as-files has.

HDF5 has a lot more ways of representing rectilinear arrays, some of which can be faster for random reads, particularly if your access has some kind of pattern (i.e. not entirely random, but not necessarily left-to-right sequential, either). Parquet only has a chunk structure, so it's only good for left-to-right sequential access. But Parquet lets you store complex data structures in a columnar way, which HDF5 does not, so it wouldn't be a quantitative question of speed but a categorical question of what data types you have. I assumed that your data are not rectilinear arrays of numbers from the fact that you're using Awkward Array. If your data are rectilinear, I'd use NumPy and HDF5; otherwise, I'd use Awkward and Parquet. I can't think of any cases in which you'd want to cross them.

(Unless you're using ak.to_buffers to bypass HDF5's limitations on rectilinear shapes: https://awkward-array.org/how-to-convert-buffers.html#saving-awkward-arrays-to-hdf5 . I also wouldn't consider HDF5's compound types–they're not columnar.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing awkward arrays incrementally #1191

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Storing awkward arrays incrementally #1191

pzelasko Dec 23, 2021

Replies: 1 comment · 2 replies

jpivarski Dec 23, 2021 Maintainer

pzelasko Jan 6, 2022 Author

jpivarski Jan 6, 2022 Maintainer

pzelasko
Dec 23, 2021

Replies: 1 comment 2 replies

jpivarski
Dec 23, 2021
Maintainer

pzelasko Jan 6, 2022
Author

jpivarski Jan 6, 2022
Maintainer