Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialization of file-like objects #830

Open
3 tasks done
moradology opened this issue Jul 30, 2024 · 5 comments
Open
3 tasks done

serialization of file-like objects #830

moradology opened this issue Jul 30, 2024 · 5 comments

Comments

@moradology
Copy link

Problem description

I'd be curious to get opinions on whether serialization/deserialization should be supported for the file-like objects at the core of this library. This would be useful for distributed processing workflows that pass around either the file-like objects themselves or - and this is the case for xarray, which is the use case I'm interested in specifically - which can be constructed using these file-like objects as arguments. Obviously, if xarray datasets are hanging onto file-like objects that are not serializable, they are then not serializable themselves.

Steps/code to reproduce the problem

  1. the file-like object itself
import pickle
import smart_open

http_file = smart_open.open('http://example.com/index.html')
pickle.dumps(http_file)

The above throws NotImplementedError: object proxy must define __reduce_ex__()

  1. the file-like object blowing up downstream object serialization
import pickle
import smart_open
import xarray as xr

netcdf_path = "https://some/netcdf/path.nc"
sf = smart_open.open(netcdf_path, 'rb')
ds = xr.open_dataset(sf)

pickle.dumps(ds)

This one throws TypeError: cannot pickle '_io.BufferedReader' object

Versions

macOS-14.4.1-arm64-arm-64bit
Python 3.11.9 (main, May 22 2024, 12:34:58) [Clang 15.0.0 (clang-1500.3.9.4)]
smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@piskvorky
Copy link
Owner

piskvorky commented Jul 31, 2024

I don't think serializing streams is even theoretically possible in general. Or rather, where it is possible, it is the business of the file-like object itself to support Python's pickle protocol, serializing its internal stream state somehow.

But open to ideas, CC @mpenkov :)

@ddelange
Copy link
Contributor

BufferedReader (only used in the smart_open.compression module) is thread-safe (ref)
but thread-safe != fork-safe so I don't think the io classes are made for multiprocessing.

I would suggest reading into a tempfile (or shared_memory if filesize allows), and sharing the filename/mem-pointer across processes.

@moradology
Copy link
Author

Good points, to be sure. I'm not proposing storage of the bytes so much as passing around the file-like objects as references (perhaps keeping seek information, but not even necessarily). This is would enable the things opened and then potentially passed to xarray to be moved between machines inside of Dask/Spark/etc. clusters nicely. Obviously this wouldn't work for disk-local file access, but for cloud providers, things online, etc. serializing the appropriate configs should be sufficient to realize the file-like objects on the other side to then seek into and read byte ranges or what have you

@ddelange
Copy link
Contributor

you could try serialising with dill. afaik dask uses/used it. maybe you can adopt it in xarray?

@moradology
Copy link
Author

For sure, dill can solve the issue in some instances but dill also doesn't seem to work in this case. I was thinking that it might be possible to manually specify the conditions for ser/de behaviors via these couple of magic methods (example properties here, but they would likely be specific to each backend):

    def __getstate__(self):
        # Called when pickling
        return {'url': self.url, 'position': self._position}

    def __setstate__(self, state):
        # Called when unpickling
        self.__init__(state['url'])
        self.seek(state['position'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants