Replies: 1 comment 2 replies
-
For incremental writing, probably Parquet would be best: it would be in batches, but each batch could be a row group. Unfortunately, we don't have an interface set up to keep an output file handle open and write one row group at a time, though it wouldn't be a big modification of what we have. On the other hand, I'm reluctant to add that interface because we're in the process of rewriting the file-writing for Awkward 2.0, so such a modification would have a short life. Instead of writing row groups, what about writing a Parquet dataset as a collection of files? That's a standard, well-recognized format (see Arrow docs), especially if the collection of files has a METADATA file. The ak.to_parquet docs describe how to use Would that work for you: one chunk (as much as will fit in memory) per file, and then treating those files as a dataset? |
Beta Was this translation helpful? Give feedback.
-
Hi there!
I'm building a library for speech data representation and processing for deep learning called lhotse. One of the aspects of speech data is its variable length, so features describing utterances will always have one dynamic dimension. I think it is very difficult to find a project that supports storing ragged arrays like these efficiently, and I was very intrigued to find about awkward array.
My question is, what is the preferred format for incremental writing of awkward arrays? I saw mentions of HDF5, arrow, feather, and parquet in the docs, but the examples all seemed to assume that the full array is known ahead of time. This is not the case with iterative feature extraction; we might have 10.000 hours of speech and want to store the arrays as we compute them for later model training.
I appreciate your help and suggestions.
Beta Was this translation helpful? Give feedback.
All reactions