Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable compression for pandas tables #7

Open
GWW opened this issue Apr 27, 2022 · 1 comment
Open

Enable compression for pandas tables #7

GWW opened this issue Apr 27, 2022 · 1 comment

Comments

@GWW
Copy link

GWW commented Apr 27, 2022

Hi,

I have noticed that compression is not enabled on pandas data frames when storing them with flammkuchen.

I have included a test example below where I store a pandas dataframe and a numpy array. The numpy array ends up compressed as per ddls while the pandas array data is not:

import numpy as np
import pandas as pd
import flammkuchen as fl
df = pd.DataFrame({'a':['B'] * 100000, 'b':np.repeat(1, 100000), 'c':np.repeat(1, 100000)})

fl.save('test.h5', {'df':df, 'npa':np.repeat(1, 100000)})
ddls -c --raw test.h5

/df                       dict
/df/axis0                 array (3,) [bytes8] none
/df/axis0_variety         'regular' (7) [unicode]
/df/axis1                 array (100000,) [int64] none
/df/axis1_variety         'regular' (7) [unicode]
/df/block0_items          array (2,) [bytes8] none
/df/block0_items_variety  'regular' (7) [unicode]
/df/block0_values         array (100000, 2) [int64] none
/df/block1_items          array (1,) [bytes8] none
/df/block1_items_variety  'regular' (7) [unicode]
/df/block1_values         pickled [object]
/df/encoding              'UTF-8' (5) [unicode]
/df/errors                'strict' (6) [unicode]
/df/nblocks               2 [int64]
/df/ndim                  2 [int64]
/df/pandas_type           'frame' (5) [unicode]
/df/pandas_version        '0.15.2' (6) [unicode]
/npa                      array (100000,) [int64] zlib lvl9

I believe this is an issue with

    class _HDFStoreWithHandle(pd.io.pytables.HDFStore):
        def __init__(self, handle):
            self._path = None
            self._complevel = None
            self._complib = None
            self._fletcher32 = False
            self._filters = None

            self._handle = handle

I think pandas does not respect the handles compression settings and having the complevel and complib set to None disables compression as per the pandas documentation. I am not sure the best way to extract the compression settings from the handle and apply it to this class.

Thanks in advance

@GWW
Copy link
Author

GWW commented Apr 27, 2022

I figured out how to add compression to pandas by passing the filters parameter. I have created a branch at here that appears to have solved the issue:

ddls -c --raw test.h5
/df                       dict
/df/axis0                 array (3,) [bytes8] zlib lvl9
/df/axis0_variety         'regular' (7) [unicode]
/df/axis1                 array (100000,) [int64] zlib lvl9
/df/axis1_variety         'regular' (7) [unicode]
/df/block0_items          array (2,) [bytes8] zlib lvl9
/df/block0_items_variety  'regular' (7) [unicode]
/df/block0_values         array (100000, 2) [int64] zlib lvl9
/df/block1_items          array (1,) [bytes8] zlib lvl9
/df/block1_items_variety  'regular' (7) [unicode]
/df/block1_values         pickled [object]
/df/encoding              'UTF-8' (5) [unicode]
/df/errors                'strict' (6) [unicode]
/df/nblocks               2 [int64]
/df/ndim                  2 [int64]
/df/pandas_type           'frame' (5) [unicode]
/df/pandas_version        '0.15.2' (6) [unicode]
/npa                      array (100000,) [int64] zlib lvl9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant