parquet format support #106

weibingo · 2018-11-17T09:16:13Z

Hi @mtth, is there any plan to support parquet data format?

parquet data has schema by self . so can read parquet to pandas directly, write is same .
python parquet module: fastparquet , pyarrow

mtth · 2018-11-20T09:12:16Z

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

wilberh · 2020-08-07T16:22:51Z

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

@mtth - do you have sample code for this approach?

ghost · 2021-11-30T08:49:59Z

For reading a Pandas dataframe in parquet format from HDFS, currently I use a BytesIO object to read the parquet file into a bytes buffer completely first and pass this to pandas afterwards.

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    buffer = BytesIO(hdfs_reader.read())
    dataframe = pd.read_parquet(buffer)

If I try to pass the hdfs_reader to Pandas directly like

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    dataframe = pd.read_parquet(hdfs_reader)

I got the following error:

Traceback (most recent call last):
  File "...", line 940, in pandas_from_parquet
    dataframe = pd.read_parquet(hdfs_reader)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 131, in read
    **kwargs).to_pandas()
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 1076, in read_table
    pf = ParquetFile(source, metadata=metadata)
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 102, in __init__
    self.reader.open(source, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 639, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: seek

Is there a way to read the parquet file into Pandas directly without reading it completely to a BytesIO object first?

mtth added the enhancement label Nov 20, 2018

mtth added the help wanted label Apr 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet format support #106

parquet format support #106

weibingo commented Nov 17, 2018

mtth commented Nov 20, 2018

wilberh commented Aug 7, 2020 •

edited

Loading

ghost commented Nov 30, 2021

parquet format support #106

parquet format support #106

Comments

weibingo commented Nov 17, 2018

mtth commented Nov 20, 2018

wilberh commented Aug 7, 2020 • edited Loading

ghost commented Nov 30, 2021

wilberh commented Aug 7, 2020 •

edited

Loading