Reading and writing in batches #381

pavlexander · 2023-10-13T17:01:08Z

pavlexander
Oct 13, 2023

RAM is not as limitless as the file storage, so I was wondering if it's possible to write and also read the data to/from parquet file in batches/chunks?

presumably, following is the method for writing in batches:

using var writer = ParquetFile.CreateRowWriter<MyDataType>(fileFullPath, properties);

writer.WriteRows(batch1); // demo purposes only
writer.WriteRows(batch2);
writer.WriteRows(batch3);

writer.Close();

But how do you read a file that exceeds the available RAM size when (if) fully loaded?

One way of dealing with this problem is splitting the output into multiple files/parts that would take up a reasonable amount of RAM when loaded individually.. The problem is - for my PC it's perfectly fine to have 15Gb loaded into memory, for others - 1 gb etc.. So should I always expect the worse scenario OR there is actually a way of getting the data partially?

Answered by adamreeve

Oct 15, 2023

Hi @pavlexander, the way this is generally dealt with in Parquet is by splitting data up into row groups. When using the row-oriented API as in your example code, ParquetSharp will internally buffer the data until you create a new row group, and we don't automatically create new row groups between WriteRows calls. So what you probably want instead is something more like:

using var writer = ParquetFile.CreateRowWriter<MyDataType>(fileFullPath, properties);

writer.WriteRows(batch1);

writer.StartNewRowGroup();
writer.WriteRows(batch2);

writer.StartNewRowGroup();
writer.WriteRows(batch3);

writer.Close();

Then when reading, you can read one row group at a time, eg.:

using var reader = Parq…

View full answer

adamreeve · 2023-10-15T20:23:38Z

adamreeve
Oct 15, 2023
Collaborator

Hi @pavlexander, the way this is generally dealt with in Parquet is by splitting data up into row groups. When using the row-oriented API as in your example code, ParquetSharp will internally buffer the data until you create a new row group, and we don't automatically create new row groups between WriteRows calls. So what you probably want instead is something more like:

using var writer = ParquetFile.CreateRowWriter<MyDataType>(fileFullPath, properties);

writer.WriteRows(batch1);

writer.StartNewRowGroup();
writer.WriteRows(batch2);

writer.StartNewRowGroup();
writer.WriteRows(batch3);

writer.Close();

Then when reading, you can read one row group at a time, eg.:

using var reader = ParquetFile.CreateRowReader<MyDataType>(fileFullPath);
for (var rowGroup = 0; rowGroup < reader.FileMetaData.NumRowGroups; ++rowGroup)
{
    var values = reader.ReadRows(rowGroup);
    // Use batch of values
}

As you note, the amount of available RAM is going to vary, so generally you would want your row groups to be small enough to easily fit in RAM on most machines.

There is some trade off there though, as the smaller your row groups are, the more overhead you incur due to extra metadata and extra reading and writing time. And larger row groups can generally be compressed better, although this is also limited by the data page size.

It is possible to read less than a full row group at a time if you use the lower level API, using LogicalColumnReader<TElement>.ReadBatch for example, which can also reduce memory by only reading one column at a time. But internally we might try to buffer more data than requested, so it's still a good idea to keep your row groups small.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading and writing in batches #381

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Reading and writing in batches #381

pavlexander Oct 13, 2023

Replies: 1 comment

adamreeve Oct 15, 2023 Collaborator

pavlexander
Oct 13, 2023

adamreeve
Oct 15, 2023
Collaborator