Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet format support #106

Open
weibingo opened this issue Nov 17, 2018 · 3 comments
Open

parquet format support #106

weibingo opened this issue Nov 17, 2018 · 3 comments

Comments

@weibingo
Copy link

Hi @mtth, is there any plan to support parquet data format?

parquet data has schema by self . so can read parquet to pandas directly, write is same .
python parquet module: fastparquet , pyarrow

@mtth
Copy link
Owner

mtth commented Nov 20, 2018

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

@wilberh
Copy link

wilberh commented Aug 7, 2020

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

@mtth - do you have sample code for this approach?

@ghost
Copy link

ghost commented Nov 30, 2021

For reading a Pandas dataframe in parquet format from HDFS, currently I use a BytesIO object to read the parquet file into a bytes buffer completely first and pass this to pandas afterwards.

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    buffer = BytesIO(hdfs_reader.read())
    dataframe = pd.read_parquet(buffer)

If I try to pass the hdfs_reader to Pandas directly like

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    dataframe = pd.read_parquet(hdfs_reader)

I got the following error:

Traceback (most recent call last):
  File "...", line 940, in pandas_from_parquet
    dataframe = pd.read_parquet(hdfs_reader)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 131, in read
    **kwargs).to_pandas()
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 1076, in read_table
    pf = ParquetFile(source, metadata=metadata)
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 102, in __init__
    self.reader.open(source, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 639, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: seek

Is there a way to read the parquet file into Pandas directly without reading it completely to a BytesIO object first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants