Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose os.DirEntry objects from pathlib #125413

Open
barneygale opened this issue Oct 13, 2024 · 2 comments
Open

Expose os.DirEntry objects from pathlib #125413

barneygale opened this issue Oct 13, 2024 · 2 comments
Labels
performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Oct 13, 2024

Feature or enhancement

Path.iterdir() uses os.scandir() under-the-hood, but it throws away the resulting os.DirEntry objects, despite their numerous useful features.

I propose we add a new Path.dir_entry attribute that stores an os.DirEntry object or None. This attribute will be set to a directory entry in paths yielded from Path.iterdir().

This would allow users to call methods such as child.dir_entry.is_symlink() to check for symlinks without incurring a mandatory system call. It will help speed up the implementation of Path.copy() too.

This description needs a rework. I've found the above suggestion is a whole can of worms!

See discussion: https://discuss.python.org/t/is-there-a-pathlib-equivalent-of-os-scandir/46626

Linked PRs

@barneygale barneygale added type-feature A feature request or enhancement performance Performance or resource usage topic-pathlib labels Oct 13, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
@ncoghlan
Copy link
Contributor

I put this feedback on the PR, but it's probably better placed here: while I like the general idea, I don't think this specific API is the right way to do it.

  • dir_entry potentially being None based on how the instance was created is inconvenient
  • the docs having to excuse dir_entry existing on PurePath objects is awkward

I think we can eliminate both of those bits of awkwardness:

  • define a new alternative construction method on os.DirEntry objects that allows one to be created from arbitrary os.PathLike objects
  • make the slot on PurePath private rather than public (presumably as _dir_entry)
  • define PathBase.dir_entry as a read-only property that returns the cached entry if it is already set, otherwise it uses the new constructor API to create a cached DirEntry instance for itself

If it's impractical to add os.DirEntry.from_path, then a pathlib._DirEntry class that just emulated the os.DirEntry API based on the real underlying Path object would also be fine

barneygale added a commit to barneygale/cpython that referenced this issue Oct 25, 2024
… once

Improve `pathlib._abc.PathBase.copy()` (which provides `Path.copy()`) by
fetching operands' supported metadata keys up-front, rather than once for
each path in the tree.

This prepares the way for using `os.DirEntry` objects in `copy()`.
@barneygale barneygale changed the title Add pathlib.Path.dir_entry Expose os.DirEntry objects from pathlib Oct 28, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 28, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`.

In the private `pathlib._abc.PathBase` class, we can rework the
`iterdir()`, `glob()`, `walk()` and `copy()` methods to call `scandir()`
and make use of cached directory entry information, and thereby improve
performance. Because the `Path.copy()` method is provided by `PathBase`,
this also speeds up traversal when copying local files and directories.
barneygale added a commit that referenced this issue Nov 1, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`. This
will be used to implement several `PathBase` methods more efficiently,
including methods that provide `Path.copy()`.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
@barneygale
Copy link
Contributor Author

To tie up the above loose ends, we went with a Path.scandir() method in the end.

barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants