borg2: enhance compact stats #8410

ThomasWaldmann · 2024-09-23T23:44:22Z

When building a ChunkIndex it currently starts from refcount=0 and then sets refcount=MAX_VALUE if a chunk is used.

That's how most of borg2 works now: it doesn't do refcounting anymore, just a boolean "do we have chunk X".

For better deduplication stats in borg compact, we could deviate from that in just borg compact and do precise refcounting without any additional effort.

Before persisting the ChunkIndex, we then need to set refcounts to MAX_VALUE, similar as we clean up the size values.

To consider:

what do we win?
we have the total size also in the archive metadata and can sum up all of these.
do we want to add per-directory stats / analytics? maybe even 2-pass stuff?

ThomasWaldmann · 2024-09-24T09:55:01Z

Comment about what's interesting for practical usage: #122 (comment)

awgcooper · 2024-09-26T02:13:48Z

This would definitely be useful: #122 (comment)

Question: if compression and/or obfuscation is enabled, would the size stats be given for the native file, pre-compression etc?

awgcooper · 2024-09-26T02:19:25Z

Something else, not sure if it relates specifically to this: let's say I have a specific file backed up. I know this because it appears in a list contents of the most recent archive. Let's say I wanted to eliminate this file from the whole repo, how would I do that. Would I simply delete the first instance of it being backup up and by doing so that would automatically eliminate all dedups? If so, how would I find it? Fusermout?

ThomasWaldmann · 2024-09-26T18:09:58Z

@awgcooper No, it does not work like that.

But you can use borg recreate to rewrite all the archives that contain the unwanted file (or the directory). Just be very careful with that and first use --dry-run --list to see if it does what you want.

ThomasWaldmann · 2024-10-12T18:07:44Z

About "what do we win?" (see top post):

I guess the only thing would be the "deduplication factor", computed as:

DF = total_deduplicated_size_uncompressed / total_undeduplicated_size_uncompressed

The first value is just the sum of all plaintext chunk sizes.
The second value is the sum of the total archive sizes of all archives.

To do that in a memory efficient way together with the already present stats (which need the compressed chunk sizes), we need to store the plaintext size AND the compressed size into the in-memory ChunkIndex we build.

So, in the end, we could show deduplication and compression factors.

ThomasWaldmann added the cmd: compact label Sep 26, 2024

ThomasWaldmann changed the title ~~enhance borg2 compact stats~~ borg2: enhance compact stats Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

borg2: enhance compact stats #8410

borg2: enhance compact stats #8410

ThomasWaldmann commented Sep 23, 2024

ThomasWaldmann commented Sep 24, 2024

awgcooper commented Sep 26, 2024 •

edited

Loading

awgcooper commented Sep 26, 2024

ThomasWaldmann commented Sep 26, 2024

ThomasWaldmann commented Oct 12, 2024 •

edited

Loading

borg2: enhance compact stats #8410

borg2: enhance compact stats #8410

Comments

ThomasWaldmann commented Sep 23, 2024

ThomasWaldmann commented Sep 24, 2024

awgcooper commented Sep 26, 2024 • edited Loading

awgcooper commented Sep 26, 2024

ThomasWaldmann commented Sep 26, 2024

ThomasWaldmann commented Oct 12, 2024 • edited Loading

awgcooper commented Sep 26, 2024 •

edited

Loading

ThomasWaldmann commented Oct 12, 2024 •

edited

Loading