-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-121267: Improve performance of tarfile (#121267) #121269
gh-121267: Improve performance of tarfile (#121267) #121269
Conversation
fb0b6c3
to
41397ce
Compare
Can the cache get outdated during execution? |
@nineteendo Yes. Suppose the passwd database changes during processing of a tar file. We have two options:
Python currently does (1) which comes at a steep (several hundred percent) performance cost as we need to re-read and parse the passwd/group database for every single file. Note also that we first stat the file, then read passwd so there is already a race condition in the code if we want to view it that way. Doing (2) is much faster as I've shown. It's also what GNU tar does, so it's not a new idea. |
But you are doing the cache at class level right? Which means it might not be a "single" file that has an inconsistent mapping. If it's a long-running process, the user could generate a tar file, then changed the mapping, and try to generate another one - but the name is cached and the change won't be reflected. |
Yeah, that's what I was thinking too, would a new parameter be better? def __init__(self, name=None, mode="r", fileobj=None, format=None,
tarinfo=None, dereference=None, ignore_zeros=None, encoding=None,
errors="surrogateescape", pax_headers=None, debug=None,
- errorlevel=None, copybufsize=None, stream=False):
+ errorlevel=None, copybufsize=None, stream=False, uname_cache=None,
+ gname_cache=None): |
I'm sorry, that was a mistake. I intended the cache to be per TarFile instance, to make the lifetime of the cache clearly limited. I'd be happy to correct it. I agree that a process-global cache isn't a good idea. @nineteendo I'm not sure I follow you. Do you mean to make the cachin opt-in? I was hoping that this could be turned on by default, so more code can benefit from the speedup. |
Nevermind, just put it on the instance (there's already a cache for inodes): Lines 1724 to 1732 in 41397ce
|
41397ce
to
8d2f912
Compare
Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved. In my simple benchmark[1], this extra caching speeds up tarfile by 8x. [1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
8d2f912
to
c5eee91
Compare
@nineteendo @gaogaotiantian Thanks for your feedback! I have pushed a fixed version now. |
Next time please do not force push - it would be easier for us to review the code and the history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a not in b
is preferred than not a in b
as it's easier to read. Also if pwd.getpwuid
raises a KeyError
, your current code does not cache it. I made some suggestions, feel free to just apply it.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be put in the comfy chair! |
Co-authored-by: Tian Gao <[email protected]>
Co-authored-by: Tian Gao <[email protected]>
@gaogaotiantian: I have made the requested changes; please review again |
Thanks for making the requested changes! @gaogaotiantian: please review the changes made to this pull request. |
@ethanfurman is the component owner of tar related stuff. I'll leave the decision for him. The code looks good to me, but I'm not sure if there is any concern specifically for the optimization. |
Misc/NEWS.d/next/Library/2024-07-02-15-56-42.gh-issue-121267.yFBWkh.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Bénédikt Tran <[email protected]>
Thanks @picnixz, I have applied your suggestions. |
@ethanfurman, would you care to have a look at the patch? I think this performance gain would be quite beneficial for users of "tarfile". |
@picnixz @gaogaotiantian @ethanfurman My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along? Thanks for the help. |
I personally have no way to commit. And patches may be pending for a long time until a core dev accepts it. I'm not a tarfile expert so I think you'll need to wait until Ethan has time to review it. |
@picnixz Thanks. I wasn't sure if people were waiting for me to do something at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch!
Thanks everyone for moving this along and sorry I missed it -- my wife was in the hospital when this first started. |
Thanks for your help everyone! Ethan, I hope your wife is feeling better. |
@hauntsaninja Should we backport the changes? (it may be nice and I don't think it breaks compatibility here)
I hope too! |
While it would be nice, it feels more like an enhancement and not a bug. (And she is, thank you.) |
We typically don't backport performance improvements. And hope things are well, Ethan! |
it would be great to backport. it's a very small patch to make tarring twice as fast. I wouldn't be surprised if this patch alone could save whole power-plant-worth-of-power from computers that are wasting time archiving stuff inefficiently. do we really have to wait for python 3.14 to land for the fix to be available? |
I've had enough "simple" patches with unintended consequences that I am not willing to backport this one. If you would like to open a discourse thread about it and get consensus that this patch is fine to backport, I'll be happy to do so. |
Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved.
In my simple benchmark[1], this extra caching speeds up tarfile by 8x.
[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2