Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing Parquet from scan_csv with encoded files (iso-8859-1 to utf-8) #20834

Open
alvaroDatlo opened this issue Jan 21, 2025 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@alvaroDatlo
Copy link

Description

Description:

I'm encountering an issue when trying to use Polars to process a CSV file with a non-UTF-8 encoding (iso-8859-1) and then write the result to a Parquet file. Specifically, I am using scan_csv to read a CSV file and then trying to save it as a Parquet file. However, this operation fails with a ComputeError: not yet implemented: Streaming scanning of in-memory buffers error when trying to write to Parquet.

Steps to Reproduce:

Create a CSV file with non-UTF-8 encoding (iso-8859-1).
Use Polars' scan_csv to read the CSV file lazily.
Attempt to write the resulting lazy frame to a Parquet file using .sink_parquet().
Encounter the error: ComputeError: not yet implemented: Streaming scanning of in-memory buffers.

import polars as pl
import codecs

# Writing a CSV file with ISO-8859-1 encoding
with open('arquivo.csv', mode="w", encoding="iso-8859-1", newline='') as f:
    f.write(";\ncol1;col2\n1;pássaro\n2;você\n3;maçã")

# Attempting to process the CSV file with scan_csv and write to Parquet
with open('arquivo.csv', "rb") as f:
    reader = codecs.EncodedFile(f, data_encoding="utf-8", file_encoding="iso-8859-1")
    lf = pl.scan_csv(source=reader, separator=";")
    lf.sink_parquet("arquivo.parquet")  # This raises the error
@alvaroDatlo alvaroDatlo added the enhancement New feature or an improvement of an existing feature label Jan 21, 2025
@deanm0000
Copy link
Collaborator

While you're waiting for that to be implemented, pyarrow's ParquetWriter let's your write parquet files in batches. Unfortunately, even the batched csv reader won't take a buffer, only a real file. As such to do batching you'd have to do it manually like this

import polars as pl
from pyarrow import parquet as pq
import codecs
from tempfile import NamedTemporaryFile


BATCH_SIZE = 1000 # This should be able to be bigger
INPUT_FILE = "arquivo.csv"
OUTPUT_FILE = "arquivo.parquet"
SEPARATOR = ";"
with open(INPUT_FILE, "rb") as f, codecs.EncodedFile(
    f, data_encoding="utf-8", file_encoding="iso-8859-1"
) as reader:
    header = None
    writer = None

    while True:
        lines = reader.readlines(BATCH_SIZE)
        if len(lines) == 0:
            break
        with NamedTemporaryFile("wb+") as ff:
            ff.write(b"".join(lines))
            ff.seek(0)
            has_header = True if header is None else False
            df = pl.read_csv(
                ff, has_header=has_header, new_columns=header, separator=SEPARATOR
            )
            if writer is None:
                writer = pq.ParquetWriter(OUTPUT_FILE, schema=df.to_arrow().schema)
            writer.write(df.to_arrow())
            if header is None:
                header = df.columns

    if writer is not None:
        writer.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants