Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different return types for S3Path.stat() and WindowsUPath.stat() #145

Closed
theogaraj opened this issue Sep 22, 2023 · 6 comments · Fixed by #179
Closed

Different return types for S3Path.stat() and WindowsUPath.stat() #145

theogaraj opened this issue Sep 22, 2023 · 6 comments · Fixed by #179
Labels
bug 🐛 Something isn't working compatibility 🤝 Compatibility with stdlib pathlib
Milestone

Comments

@theogaraj
Copy link

theogaraj commented Sep 22, 2023

Which operating system and Python version are you using?
Windows 11, Python 3.9.6

Which version of this project are you using?
0.1.3

What did you do?
I am attempting to use universal_pathlib in order to have a unified way of handling files whether they are local or in S3.
One of the things I need to do is get all the files in a folder (yes I know S3 doesn't have actual folders, but hopefully you understand what I mean) and get their sizes.

>>> from upath import UPath
>>> lpath = UPath('local_data/files/')
>>>spath = UPath('s3://test_bucket/files/')
>>> lfiles = lpath.iterdir()
>>> lf = next(lfiles)
>>> type(lf.stat())
<class 'os.stat_result'>
>>> sfiles = spath.iterdir()
>>> sf = next(sfiles)
>>> type(sf.stat())
<class 'dict'>
>>> lf.stat().st_size
1657
>>> sf.stat().st_size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'st_size'

What did you expect to see?
I expected to be able to use same code to get file sizes whether they are in local directory or in S3

What did you see instead?
The return types of stat() are different for S3Path vs WindowsUPath and I can't get file size in the same way from each.

Would this difference in behavior be something you would consider reconciling? Alternatively, do you have suggestions on another approach to achieving what I'm trying to do?

@ap--
Copy link
Collaborator

ap-- commented Sep 23, 2023

Hi @theogaraj

This is definitely an issue, and we should create a UPathStatResult class (name up for discussion) that provides an interface compatible with dict (fsspec) and os.stat_result

We need to ensure that the os.stat_result attribute names map to the equivalent type for each of the fsspec filesystems, too.

relevant fsspec issues:


For now I don't have a good recommendation other than
_stat = pth.stat()
size = _stat["size"] if isinstance(_stat, dict) else _stat.st_size

Cheers,
Andreas

@ap-- ap-- added bug 🐛 Something isn't working compatibility 🤝 Compatibility with stdlib pathlib labels Sep 23, 2023
@ap--
Copy link
Collaborator

ap-- commented Sep 24, 2023

I vaguely remembered that I implemented something related to this already...

getmtime on fsspec

https://github.com/mdshw5/pyfaidx/blob/cac82f24e9c4e334cf87a92e477b92d4615d260f/pyfaidx/__init__.py#L1318-L1345

@theogaraj
Copy link
Author

@ap-- thank you for checking and responding so promptly. Because of another problem I had with iterdir (logged as #146) I ended up going with the slightly clunkier spath.fs.ls(str(spath)) and then accessing the ['size] attribute common to both local and S3.

So this is by no means a showstopper. I'll track this issue and can update my code whenever someone is able to resolve these two issues.

@bolkedebruin
Copy link

Just ran across this. See https://github.com/apache/airflow/blob/main/airflow/io/store/stat.py for a stat compatible version.

@asford
Copy link

asford commented Feb 4, 2024

I just ran across the same issue here and would be happy to fire a PR with a fix, though it looks like @ap-- may already be investigating?

@bolkedebruin make several updates to the airflow io provider while inheriting from UPath in apache/airflow#35612 which are great prior art here.

I would propose that we just port the changes to support stat into UPath, which is now under:

@ap-- ap-- added this to the v0.2 milestone Feb 8, 2024
@ap--
Copy link
Collaborator

ap-- commented Feb 9, 2024

Collected Info

For links to info() dicts of a lot of FileSystem implementations check fsspec/filesystem_spec#526 (comment)

All filesystems have "name", "size" and "type".

  • "name" is a string
  • "size" is an int or None
  • "type" is a string that is almost always "file" or "directory" but it can be "other", or (something else)

For translating to os.stat_result.st_* attributes, these are the keys that could be checked:

Attribute Possible Info Keys
st_mode mode, unix.mode, writable, isLink, nlink, permission, isexec
st_ino ino, name, id, sha, hex, Digest?
st_dev
st_nlink nlink, isLink
st_uid uid, owner, uname, unix.owner
st_gid gid, group, gname, unix.group
st_size size
st_atime time, last_accessed_on, accessTime
st_mtime mtime, last_modified, last_modification_time_ms, timeModified, modify, modificationTime, LastModified, modified_at
st_ctime
st_birthtime created, creation_time, timeCreated, created_at

Additional info to be considered comes from specific filesystems:

All data types need to be normalized to int

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working compatibility 🤝 Compatibility with stdlib pathlib
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants