gh-121267: Improve performance of tarfile (#121267) #121269

jforberg · 2024-07-02T13:58:55Z

Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

Issue: Tarfile is unnecessarily slow #121267

cpython-cla-bot · 2024-07-02T13:58:58Z

All commit authors signed the Contributor License Agreement.

nineteendo · 2024-07-02T15:47:50Z

Can the cache get outdated during execution?

jforberg · 2024-07-02T16:35:03Z

@nineteendo Yes. Suppose the passwd database changes during processing of a tar file. We have two options:

Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.
Keep the same uname as when the UID was first encountered, ignore the new change.

Python currently does (1) which comes at a steep (several hundred percent) performance cost as we need to re-read and parse the passwd/group database for every single file. Note also that we first stat the file, then read passwd so there is already a race condition in the code if we want to view it that way.

Doing (2) is much faster as I've shown. It's also what GNU tar does, so it's not a new idea.

gaogaotiantian · 2024-07-02T17:17:45Z

Generate a tar file with inconsistent UID->uname mapping. Files owned by the same user will have different uname.

But you are doing the cache at class level right? Which means it might not be a "single" file that has an inconsistent mapping. If it's a long-running process, the user could generate a tar file, then changed the mapping, and try to generate another one - but the name is cached and the change won't be reflected.

nineteendo · 2024-07-02T17:27:18Z

Yeah, that's what I was thinking too, would a new parameter be better?

 def __init__(self, name=None, mode="r", fileobj=None, format=None,
         tarinfo=None, dereference=None, ignore_zeros=None, encoding=None,
         errors="surrogateescape", pax_headers=None, debug=None,
-        errorlevel=None, copybufsize=None, stream=False):
+        errorlevel=None, copybufsize=None, stream=False, uname_cache=None,
+        gname_cache=None):

jforberg · 2024-07-02T21:47:32Z

But you are doing the cache at class level right?

I'm sorry, that was a mistake. I intended the cache to be per TarFile instance, to make the lifetime of the cache clearly limited. I'd be happy to correct it. I agree that a process-global cache isn't a good idea.

@nineteendo I'm not sure I follow you. Do you mean to make the cachin opt-in? I was hoping that this could be turned on by default, so more code can benefit from the speedup.

nineteendo · 2024-07-03T04:48:59Z

Nevermind, just put it on the instance (there's already a cache for inodes):

cpython/Lib/tarfile.py

Lines 1724 to 1732 in 41397ce

    
           # Init datastructures. 
        
           self.copybufsize = copybufsize 
        
           self.closed = False 
        
           self.members = []       # list of members as TarInfo objects 
        
           self._loaded = False    # flag if all members have been read 
        
           self.offset = self.fileobj.tell() 
        
                                   # current position in the archive file 
        
           self.inodes = {}        # dictionary caching the inodes of 
        
                                   # archive members already added

Tarfile in the default write mode spends much of its time resolving UIDs into usernames and GIDs into group names. By caching these mappings, a significant speedup can be achieved. In my simple benchmark[1], this extra caching speeds up tarfile by 8x. [1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

jforberg · 2024-07-03T20:03:33Z

@nineteendo @gaogaotiantian Thanks for your feedback! I have pushed a fixed version now.

gaogaotiantian · 2024-07-03T20:03:44Z

Next time please do not force push - it would be easier for us to review the code and the history.

gaogaotiantian

a not in b is preferred than not a in b as it's easier to read. Also if pwd.getpwuid raises a KeyError, your current code does not cache it. I made some suggestions, feel free to just apply it.

Lib/tarfile.py

bedevere-app · 2024-07-03T20:11:28Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

Co-authored-by: Tian Gao <[email protected]>

jforberg · 2024-07-03T20:14:34Z

@gaogaotiantian: I have made the requested changes; please review again

bedevere-app · 2024-07-03T20:14:38Z

Thanks for making the requested changes!

@gaogaotiantian: please review the changes made to this pull request.

gaogaotiantian · 2024-07-03T23:25:35Z

@ethanfurman is the component owner of tar related stuff. I'll leave the decision for him. The code looks good to me, but I'm not sure if there is any concern specifically for the optimization.

Lib/tarfile.py

Misc/NEWS.d/next/Library/2024-07-02-15-56-42.gh-issue-121267.yFBWkh.rst

Co-authored-by: Bénédikt Tran <[email protected]>

…rmance

jforberg · 2024-07-08T13:20:13Z

Thanks @picnixz, I have applied your suggestions.

jforberg · 2024-07-18T13:10:27Z

@ethanfurman, would you care to have a look at the patch? I think this performance gain would be quite beneficial for users of "tarfile".

jforberg · 2024-08-24T13:35:44Z

@picnixz @gaogaotiantian @ethanfurman My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along? Thanks for the help.

picnixz · 2024-08-24T13:39:35Z

My patch has been pending for a month or so and doesn't seem to be moving forward at the moment. Is there something further that I should be doing to help it along

I personally have no way to commit. And patches may be pending for a long time until a core dev accepts it. I'm not a tarfile expert so I think you'll need to wait until Ethan has time to review it.

jforberg · 2024-08-24T14:27:47Z

@picnixz Thanks. I wasn't sure if people were waiting for me to do something at this point.

Lib/tarfile.py

morotti · 2024-10-30T18:20:25Z

hello, not a core dev

i just noticed too that the performance of tarfile is horrible and found this ticket.

Commercial profiler run, using tarfile to tar the venv.
96 seconds to do the tarfile.add(), one third of that is looking up user/group info.

hauntsaninja

Thanks for the patch!

ethanfurman · 2024-10-30T22:40:59Z

Thanks everyone for moving this along and sorry I missed it -- my wife was in the hospital when this first started.

jforberg · 2024-10-30T22:48:40Z

Thanks for your help everyone! Ethan, I hope your wife is feeling better.

picnixz · 2024-10-30T23:12:55Z

@hauntsaninja Should we backport the changes? (it may be nice and I don't think it breaks compatibility here)

Ethan, I hope your wife is feeling better

I hope too!

ethanfurman · 2024-10-30T23:23:35Z

While it would be nice, it feels more like an enhancement and not a bug.

(And she is, thank you.)

hauntsaninja · 2024-10-30T23:24:00Z

We typically don't backport performance improvements. And hope things are well, Ethan!

morotti · 2024-10-31T12:35:46Z

it would be great to backport. it's a very small patch to make tarring twice as fast.

I wouldn't be surprised if this patch alone could save whole power-plant-worth-of-power from computers that are wasting time archiving stuff inefficiently. do we really have to wait for python 3.14 to land for the fix to be available?

ethanfurman · 2024-10-31T22:57:11Z

I've had enough "simple" patches with unintended consequences that I am not willing to backport this one. If you would like to open a discourse thread about it and get consensus that this patch is fine to backport, I'll be happy to do so.

jforberg requested a review from ethanfurman as a code owner July 2, 2024 13:58

bedevere-app bot mentioned this pull request Jul 2, 2024

Tarfile is unnecessarily slow #121267

Closed

bedevere-app bot added the awaiting review label Jul 2, 2024

jforberg force-pushed the improve_tarfile_performance branch from fb0b6c3 to 41397ce Compare July 2, 2024 14:05

jforberg force-pushed the improve_tarfile_performance branch from 41397ce to 8d2f912 Compare July 3, 2024 20:01

jforberg force-pushed the improve_tarfile_performance branch from 8d2f912 to c5eee91 Compare July 3, 2024 20:02

gaogaotiantian requested changes Jul 3, 2024

View reviewed changes

Lib/tarfile.py Outdated Show resolved Hide resolved

Lib/tarfile.py Outdated Show resolved Hide resolved

bedevere-app bot added awaiting changes and removed awaiting review labels Jul 3, 2024

jforberg and others added 2 commits July 3, 2024 22:13

Apply suggestions from gaogaotiantian (1/2)

2e72fc7

Co-authored-by: Tian Gao <[email protected]>

Apply suggestions from gaogaotiantian (2/2)

dc117ae

Co-authored-by: Tian Gao <[email protected]>

bedevere-app bot added awaiting change review and removed awaiting changes labels Jul 3, 2024

bedevere-app bot requested a review from gaogaotiantian July 3, 2024 20:14

gaogaotiantian approved these changes Jul 3, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting change review labels Jul 3, 2024

picnixz reviewed Jul 6, 2024

View reviewed changes

Lib/tarfile.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2024-07-02-15-56-42.gh-issue-121267.yFBWkh.rst Outdated Show resolved Hide resolved

jforberg and others added 3 commits July 8, 2024 15:00

Apply suggestions from picnixz (1/2)

f8bd453

Co-authored-by: Bénédikt Tran <[email protected]>

Apply suggestions from picnixz (2/2)

454f94a

Merge remote-tracking branch 'origin/main' into improve_tarfile_perfo…

58e2f0f

…rmance

Merge branch 'main' into improve_tarfile_performance

f0c7a1b

hauntsaninja reviewed Aug 24, 2024

View reviewed changes

Lib/tarfile.py Show resolved Hide resolved

Apply suggestion from hauntsaninja

8cf9e54

hauntsaninja approved these changes Oct 30, 2024

View reviewed changes

Merge branch 'main' into improve_tarfile_performance

c73a251

hauntsaninja merged commit 2b2d607 into python:main Oct 30, 2024
36 checks passed

bedevere-app bot removed the awaiting merge label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-121267: Improve performance of tarfile (#121267) #121269

gh-121267: Improve performance of tarfile (#121267) #121269

jforberg commented Jul 2, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jul 2, 2024 •

edited

Loading

nineteendo commented Jul 2, 2024

jforberg commented Jul 2, 2024

gaogaotiantian commented Jul 2, 2024

nineteendo commented Jul 2, 2024

jforberg commented Jul 2, 2024

nineteendo commented Jul 3, 2024

jforberg commented Jul 3, 2024

gaogaotiantian commented Jul 3, 2024

gaogaotiantian left a comment

bedevere-app bot commented Jul 3, 2024

jforberg commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

gaogaotiantian commented Jul 3, 2024

jforberg commented Jul 8, 2024

jforberg commented Jul 18, 2024

jforberg commented Aug 24, 2024

picnixz commented Aug 24, 2024

jforberg commented Aug 24, 2024

morotti commented Oct 30, 2024

hauntsaninja left a comment

ethanfurman commented Oct 30, 2024

jforberg commented Oct 30, 2024

picnixz commented Oct 30, 2024

ethanfurman commented Oct 30, 2024

hauntsaninja commented Oct 30, 2024

morotti commented Oct 31, 2024

ethanfurman commented Oct 31, 2024

gh-121267: Improve performance of tarfile (#121267) #121269

gh-121267: Improve performance of tarfile (#121267) #121269

Conversation

jforberg commented Jul 2, 2024 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Jul 2, 2024 • edited Loading

nineteendo commented Jul 2, 2024

jforberg commented Jul 2, 2024

gaogaotiantian commented Jul 2, 2024

nineteendo commented Jul 2, 2024

jforberg commented Jul 2, 2024

nineteendo commented Jul 3, 2024

jforberg commented Jul 3, 2024

gaogaotiantian commented Jul 3, 2024

gaogaotiantian left a comment

Choose a reason for hiding this comment

bedevere-app bot commented Jul 3, 2024

jforberg commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

gaogaotiantian commented Jul 3, 2024

jforberg commented Jul 8, 2024

jforberg commented Jul 18, 2024

jforberg commented Aug 24, 2024

picnixz commented Aug 24, 2024

jforberg commented Aug 24, 2024

morotti commented Oct 30, 2024

hauntsaninja left a comment

Choose a reason for hiding this comment

ethanfurman commented Oct 30, 2024

jforberg commented Oct 30, 2024

picnixz commented Oct 30, 2024

ethanfurman commented Oct 30, 2024

hauntsaninja commented Oct 30, 2024

morotti commented Oct 31, 2024

ethanfurman commented Oct 31, 2024

jforberg commented Jul 2, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jul 2, 2024 •

edited

Loading