Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

alvaroDatlo · 2025-01-21T19:24:29Z

Description

Description:

I'm encountering an issue when trying to use Polars to process a CSV file with a non-UTF-8 encoding (iso-8859-1) and then write the result to a Parquet file. Specifically, I am using scan_csv to read a CSV file and then trying to save it as a Parquet file. However, this operation fails with a ComputeError: not yet implemented: Streaming scanning of in-memory buffers error when trying to write to Parquet.

Steps to Reproduce:

Create a CSV file with non-UTF-8 encoding (iso-8859-1).
Use Polars' scan_csv to read the CSV file lazily.
Attempt to write the resulting lazy frame to a Parquet file using .sink_parquet().
Encounter the error: ComputeError: not yet implemented: Streaming scanning of in-memory buffers.

import polars as pl
import codecs

# Writing a CSV file with ISO-8859-1 encoding
with open('arquivo.csv', mode="w", encoding="iso-8859-1", newline='') as f:
    f.write(";\ncol1;col2\n1;pássaro\n2;você\n3;maçã")

# Attempting to process the CSV file with scan_csv and write to Parquet
with open('arquivo.csv', "rb") as f:
    reader = codecs.EncodedFile(f, data_encoding="utf-8", file_encoding="iso-8859-1")
    lf = pl.scan_csv(source=reader, separator=";")
    lf.sink_parquet("arquivo.parquet")  # This raises the error

deanm0000 · 2025-01-23T20:44:01Z

While you're waiting for that to be implemented, pyarrow's ParquetWriter let's your write parquet files in batches. Unfortunately, even the batched csv reader won't take a buffer, only a real file. As such to do batching you'd have to do it manually like this

import polars as pl
from pyarrow import parquet as pq
import codecs
from tempfile import NamedTemporaryFile


BATCH_SIZE = 1000 # This should be able to be bigger
INPUT_FILE = "arquivo.csv"
OUTPUT_FILE = "arquivo.parquet"
SEPARATOR = ";"
with open(INPUT_FILE, "rb") as f, codecs.EncodedFile(
    f, data_encoding="utf-8", file_encoding="iso-8859-1"
) as reader:
    header = None
    writer = None

    while True:
        lines = reader.readlines(BATCH_SIZE)
        if len(lines) == 0:
            break
        with NamedTemporaryFile("wb+") as ff:
            ff.write(b"".join(lines))
            ff.seek(0)
            has_header = True if header is None else False
            df = pl.read_csv(
                ff, has_header=has_header, new_columns=header, separator=SEPARATOR
            )
            if writer is None:
                writer = pq.ParquetWriter(OUTPUT_FILE, schema=df.to_arrow().schema)
            writer.write(df.to_arrow())
            if header is None:
                header = df.columns

    if writer is not None:
        writer.close()

alvaroDatlo added the enhancement New feature or an improvement of an existing feature label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

alvaroDatlo commented Jan 21, 2025

deanm0000 commented Jan 23, 2025

Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

Comments

alvaroDatlo commented Jan 21, 2025

Description

Description:

Steps to Reproduce:

deanm0000 commented Jan 23, 2025