performance, file size issue #377
Unanswered
pavlexander
asked this question in
Q&A
Replies: 1 comment 11 replies
-
Hi @pavlexander, I think the factor to consider is default column encodings used by
|
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
Just started using this library and trying to compare it's performance to main competitor Parquet.NET (which is also referenced on a main page)!
the data
2_673_685 number of rows
parquetSharp data save approaches
I have so far managed to save the data using 2 approaches that seemingly yield the same results.
approach 1:
approach 2
quick summary and a side question
personally I don't see much use in approach 2, since it's more complex and more error prone. Is there something I gain from it when compared to approach 1?
parquet.net approach
Regardless, moving on back to the main topic. Here is the code I am using to save the data in parquet.net
see that in all 3 approaches Snappy compression method is used
results
ParquetSharp
: ~800ms (817). File size: 81gb (approach 1)Parquet.net
: ~800ms (788). File size: 31gb.results interpretation
As you see both libraries take almost the same amount of time to save the file. This is OK.
But then the main issue is the file size. Is there a reason why
ParquetSharp
result takes that much space?I have extracted the metadata with
ParquetViewer
and compared both files. It seems like there are quite a lot of differences. Some of them are (left:parquet.net
, right:parquetSharp
):schema version:
encodings:
something else
package versions
<PackageReference Include="Parquet.Net" Version="4.16.4" />
(Parquet.Net version 4.16.4 (build 42339e08d7520ef1301b27689e1d1c02d91b058e))<PackageReference Include="ParquetSharp" Version="12.0.1" />
(parquet-cpp-arrow version 12.0.1)Beta Was this translation helpful? Give feedback.
All reactions