Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-121267: Improve performance of tarfile (#121267) #121269

Merged
merged 9 commits into from
Oct 30, 2024

Conversation

jforberg
Copy link
Contributor

@jforberg jforberg commented Jul 2, 2024

Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

Copy link

cpython-cla-bot bot commented Jul 2, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@nineteendo
Copy link
Contributor

Can the cache get outdated during execution?

@jforberg
Copy link
Contributor Author

jforberg commented Jul 2, 2024

@nineteendo Yes. Suppose the passwd database changes during processing of a tar file. We have two options:

  1. Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.
  2. Keep the same uname as when the UID was first encountered, ignore the new change.

Python currently does (1) which comes at a steep (several hundred percent) performance cost as we need to re-read and parse the passwd/group database for every single file. Note also that we first stat the file, then read passwd so there is already a race condition in the code if we want to view it that way.

Doing (2) is much faster as I've shown. It's also what GNU tar does, so it's not a new idea.

@gaogaotiantian
Copy link
Member

Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.

But you are doing the cache at class level right? Which means it might not be a "single" file that has an inconsistent mapping. If it's a long-running process, the user could generate a tar file, then changed the mapping, and try to generate another one - but the name is cached and the change won't be reflected.

@nineteendo
Copy link
Contributor

Yeah, that's what I was thinking too, would a new parameter be better?

 def __init__(self, name=None, mode="r", fileobj=None, format=None,
         tarinfo=None, dereference=None, ignore_zeros=None, encoding=None,
         errors="surrogateescape", pax_headers=None, debug=None,
-        errorlevel=None, copybufsize=None, stream=False):
+        errorlevel=None, copybufsize=None, stream=False, uname_cache=None,
+        gname_cache=None):

@jforberg
Copy link
Contributor Author

jforberg commented Jul 2, 2024

But you are doing the cache at class level right?

I'm sorry, that was a mistake. I intended the cache to be per TarFile instance, to make the lifetime of the cache clearly limited. I'd be happy to correct it. I agree that a process-global cache isn't a good idea.

@nineteendo I'm not sure I follow you. Do you mean to make the cachin opt-in? I was hoping that this could be turned on by default, so more code can benefit from the speedup.

@nineteendo
Copy link
Contributor

Nevermind, just put it on the instance (there's already a cache for inodes):

cpython/Lib/tarfile.py

Lines 1724 to 1732 in 41397ce

# Init datastructures.
self.copybufsize = copybufsize
self.closed = False
self.members = [] # list of members as TarInfo objects
self._loaded = False # flag if all members have been read
self.offset = self.fileobj.tell()
# current position in the archive file
self.inodes = {} # dictionary caching the inodes of
# archive members already added

@jforberg jforberg force-pushed the improve_tarfile_performance branch from 41397ce to 8d2f912 Compare July 3, 2024 20:01
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
@jforberg jforberg force-pushed the improve_tarfile_performance branch from 8d2f912 to c5eee91 Compare July 3, 2024 20:02
@jforberg
Copy link
Contributor Author

jforberg commented Jul 3, 2024

@nineteendo @gaogaotiantian Thanks for your feedback! I have pushed a fixed version now.

@gaogaotiantian
Copy link
Member

Next time please do not force push - it would be easier for us to review the code and the history.

Copy link
Member

@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a not in b is preferred than not a in b as it's easier to read. Also if pwd.getpwuid raises a KeyError, your current code does not cache it. I made some suggestions, feel free to just apply it.

Lib/tarfile.py Outdated Show resolved Hide resolved
Lib/tarfile.py Outdated Show resolved Hide resolved
@bedevere-app
Copy link

bedevere-app bot commented Jul 3, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

@jforberg
Copy link
Contributor Author

jforberg commented Jul 3, 2024

@gaogaotiantian: I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Jul 3, 2024

Thanks for making the requested changes!

@gaogaotiantian: please review the changes made to this pull request.

@gaogaotiantian
Copy link
Member

@ethanfurman is the component owner of tar related stuff. I'll leave the decision for him. The code looks good to me, but I'm not sure if there is any concern specifically for the optimization.

@jforberg
Copy link
Contributor Author

jforberg commented Jul 8, 2024

Thanks @picnixz, I have applied your suggestions.

@jforberg
Copy link
Contributor Author

@ethanfurman, would you care to have a look at the patch? I think this performance gain would be quite beneficial for users of "tarfile".

@jforberg
Copy link
Contributor Author

@picnixz @gaogaotiantian @ethanfurman My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along? Thanks for the help.

@picnixz
Copy link
Contributor

picnixz commented Aug 24, 2024

My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along

I personally have no way to commit. And patches may be pending for a long time until a core dev accepts it. I'm not a tarfile expert so I think you'll need to wait until Ethan has time to review it.

@jforberg
Copy link
Contributor Author

@picnixz Thanks. I wasn't sure if people were waiting for me to do something at this point.

@morotti
Copy link
Contributor

morotti commented Oct 30, 2024

hello, not a core dev

i just noticed too that the performance of tarfile is horrible and found this ticket.

Commercial profiler run, using tarfile to tar the venv.
96 seconds to do the tarfile.add(), one third of that is looking up user/group info.
image

Copy link
Contributor

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

@hauntsaninja hauntsaninja merged commit 2b2d607 into python:main Oct 30, 2024
36 checks passed
@ethanfurman
Copy link
Member

Thanks everyone for moving this along and sorry I missed it -- my wife was in the hospital when this first started.

@jforberg
Copy link
Contributor Author

Thanks for your help everyone! Ethan, I hope your wife is feeling better.

@picnixz
Copy link
Contributor

picnixz commented Oct 30, 2024

@hauntsaninja Should we backport the changes? (it may be nice and I don't think it breaks compatibility here)

Ethan, I hope your wife is feeling better

I hope too!

@ethanfurman
Copy link
Member

While it would be nice, it feels more like an enhancement and not a bug.

(And she is, thank you.)

@hauntsaninja
Copy link
Contributor

We typically don't backport performance improvements. And hope things are well, Ethan!

@morotti
Copy link
Contributor

morotti commented Oct 31, 2024

it would be great to backport. it's a very small patch to make tarring twice as fast.

I wouldn't be surprised if this patch alone could save whole power-plant-worth-of-power from computers that are wasting time archiving stuff inefficiently. do we really have to wait for python 3.14 to land for the fix to be available?

@ethanfurman
Copy link
Member

I've had enough "simple" patches with unintended consequences that I am not willing to backport this one. If you would like to open a discourse thread about it and get consensus that this patch is fine to backport, I'll be happy to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants