You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue when trying to use Polars to process a CSV file with a non-UTF-8 encoding (iso-8859-1) and then write the result to a Parquet file. Specifically, I am using scan_csv to read a CSV file and then trying to save it as a Parquet file. However, this operation fails with a ComputeError: not yet implemented: Streaming scanning of in-memory buffers error when trying to write to Parquet.
Steps to Reproduce:
Create a CSV file with non-UTF-8 encoding (iso-8859-1).
Use Polars' scan_csv to read the CSV file lazily.
Attempt to write the resulting lazy frame to a Parquet file using .sink_parquet().
Encounter the error: ComputeError: not yet implemented: Streaming scanning of in-memory buffers.
importpolarsasplimportcodecs# Writing a CSV file with ISO-8859-1 encodingwithopen('arquivo.csv', mode="w", encoding="iso-8859-1", newline='') asf:
f.write(";\ncol1;col2\n1;pássaro\n2;você\n3;maçã")
# Attempting to process the CSV file with scan_csv and write to Parquetwithopen('arquivo.csv', "rb") asf:
reader=codecs.EncodedFile(f, data_encoding="utf-8", file_encoding="iso-8859-1")
lf=pl.scan_csv(source=reader, separator=";")
lf.sink_parquet("arquivo.parquet") # This raises the error
The text was updated successfully, but these errors were encountered:
While you're waiting for that to be implemented, pyarrow's ParquetWriter let's your write parquet files in batches. Unfortunately, even the batched csv reader won't take a buffer, only a real file. As such to do batching you'd have to do it manually like this
importpolarsasplfrompyarrowimportparquetaspqimportcodecsfromtempfileimportNamedTemporaryFileBATCH_SIZE=1000# This should be able to be biggerINPUT_FILE="arquivo.csv"OUTPUT_FILE="arquivo.parquet"SEPARATOR=";"withopen(INPUT_FILE, "rb") asf, codecs.EncodedFile(
f, data_encoding="utf-8", file_encoding="iso-8859-1"
) asreader:
header=Nonewriter=NonewhileTrue:
lines=reader.readlines(BATCH_SIZE)
iflen(lines) ==0:
breakwithNamedTemporaryFile("wb+") asff:
ff.write(b"".join(lines))
ff.seek(0)
has_header=TrueifheaderisNoneelseFalsedf=pl.read_csv(
ff, has_header=has_header, new_columns=header, separator=SEPARATOR
)
ifwriterisNone:
writer=pq.ParquetWriter(OUTPUT_FILE, schema=df.to_arrow().schema)
writer.write(df.to_arrow())
ifheaderisNone:
header=df.columnsifwriterisnotNone:
writer.close()
Description
Description:
I'm encountering an issue when trying to use Polars to process a CSV file with a non-UTF-8 encoding (iso-8859-1) and then write the result to a Parquet file. Specifically, I am using scan_csv to read a CSV file and then trying to save it as a Parquet file. However, this operation fails with a ComputeError: not yet implemented: Streaming scanning of in-memory buffers error when trying to write to Parquet.
Steps to Reproduce:
Create a CSV file with non-UTF-8 encoding (iso-8859-1).
Use Polars' scan_csv to read the CSV file lazily.
Attempt to write the resulting lazy frame to a Parquet file using .sink_parquet().
Encounter the error: ComputeError: not yet implemented: Streaming scanning of in-memory buffers.
The text was updated successfully, but these errors were encountered: