scan_csv().sink_parquet() is significantly slower than using collect().write_parquet(), and the resulting file sizes are different #20815
Labels
A-io-parquet
Area: reading/writing Parquet files
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
performance
Performance issues or improvements
python
Related to Python Polars
Checks
Reproducible example
output
scan_csv().sink_parquet()
(slow):collect().write_parquet()
(fast):Additionally, the resulting file sizes are different:
Issue description
I am trying to process multiple CSV files using Polars and save them as Parquet files. The method
scan_csv().sink_parquet()
is much slower than usingcollect().write_parquet()
, and the resulting Parquet file sizes are different. Here’s the code I used:Expected behavior
scan_csv().sink_parquet()
method should have similar performance and file size results compared tocollect().write_parquet()
.Installed versions
The text was updated successfully, but these errors were encountered: