Reading and writing in batches #381
-
RAM is not as limitless as the file storage, so I was wondering if it's possible to write and also read the data to/from parquet file in batches/chunks? presumably, following is the method for writing in batches: using var writer = ParquetFile.CreateRowWriter<MyDataType>(fileFullPath, properties);
writer.WriteRows(batch1); // demo purposes only
writer.WriteRows(batch2);
writer.WriteRows(batch3);
writer.Close(); But how do you read a file that exceeds the available RAM size when (if) fully loaded? One way of dealing with this problem is splitting the output into multiple files/parts that would take up a reasonable amount of RAM when loaded individually.. The problem is - for my PC it's perfectly fine to have 15Gb loaded into memory, for others - 1 gb etc.. So should I always expect the worse scenario OR there is actually a way of getting the data partially? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @pavlexander, the way this is generally dealt with in Parquet is by splitting data up into row groups. When using the row-oriented API as in your example code, ParquetSharp will internally buffer the data until you create a new row group, and we don't automatically create new row groups between using var writer = ParquetFile.CreateRowWriter<MyDataType>(fileFullPath, properties);
writer.WriteRows(batch1);
writer.StartNewRowGroup();
writer.WriteRows(batch2);
writer.StartNewRowGroup();
writer.WriteRows(batch3);
writer.Close(); Then when reading, you can read one row group at a time, eg.: using var reader = ParquetFile.CreateRowReader<MyDataType>(fileFullPath);
for (var rowGroup = 0; rowGroup < reader.FileMetaData.NumRowGroups; ++rowGroup)
{
var values = reader.ReadRows(rowGroup);
// Use batch of values
} As you note, the amount of available RAM is going to vary, so generally you would want your row groups to be small enough to easily fit in RAM on most machines. There is some trade off there though, as the smaller your row groups are, the more overhead you incur due to extra metadata and extra reading and writing time. And larger row groups can generally be compressed better, although this is also limited by the data page size. It is possible to read less than a full row group at a time if you use the lower level API, using |
Beta Was this translation helpful? Give feedback.
Hi @pavlexander, the way this is generally dealt with in Parquet is by splitting data up into row groups. When using the row-oriented API as in your example code, ParquetSharp will internally buffer the data until you create a new row group, and we don't automatically create new row groups between
WriteRows
calls. So what you probably want instead is something more like:Then when reading, you can read one row group at a time, eg.: