Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repozo incremental recover #403

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

Sebatyne
Copy link
Contributor

@Sebatyne Sebatyne commented Oct 22, 2024

Like for the --backup mode of repozo, add an incremental feature for the recovery of a backup to a Data.fs, which allows to only append the latest backup(s) increment(s) (.deltafs) to a previously recovered Data.fs, instead of trashing it and restarted the recovery process from scratch.

This feature becomes the new default behavior, but the new flag "--full" allows to fall back to the previous behavior.

A few checks are done while trying to recover incrementally (ie: on size, or on the latest increment checksum), and the code automatically falls back to the full-recovery mode if they fail. This would happen for exemple if the production data has been packed after the previous recovery.

The need for such feature arose from our own production use, where we create delta backups of a file storage every day, send them to a stand-by server, and rebuild the ZODB there (still every day). When the ZODB is bigger than 1Tb, the full recovery can take several hours, whereas the incremental recovery would take a few minutes only (often even less)

…on to the implementation of the incremental recover
Which allows to recover a zodb filestorage by only appending the missing
chunks from the latest recovered file, instead of always recovering from
zero.

Based on the work of @vpelletier (incpozo).
@Sebatyne
Copy link
Contributor Author

@vpelletier , maybe you wish to review as you made the original work ?

@perrinjerome, you're the last contributor to repozo, could you review this PR ?

Thanks,

Copy link
Contributor

@perrinjerome perrinjerome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a change log entry ?

I just looked at the diff this time, I will try to actually try running repozo with this patch later

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved
src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved
@Sebatyne
Copy link
Contributor Author

Hello,

I would like to explain more my reasoning about the new behavior, as a change of default can be surprising.

A good practice with a backup/recovery plan is to check that the backed-up data can be restored on a remote service. That's why we recover the Delta.fs every day, to check that the latest .deltafs increment (which is the only new backed-up file every day, as the other .deltafs and the original .fs are already synchronised on the remote site) is valid.

From this observation, as when we import the new increment, we already have the recovered Delta.fs from the previous day, it sounds a waste of resource to delete it, and rebuild it from 0. If we could simply recover the new increment on the existing Delta.fs, then its sum would be checked, proving its validity once and for all. And we don't need to check its validity every day, as a data corruption is most likely to happen during the write process or the network copy.

Also, I believe the time saved to not restore a full Data.fs is welcome, as it allows to decrease the time-to-recovery in case of activation of the disaster recovery plan, or simply to create backups more often, to decrease the quantity of lost data in a production incident.

Please feel free to ask me more questions.

Regards,

Nicolas

@Sebatyne
Copy link
Contributor Author

Can you also add a change log entry ?

I just looked at the diff this time, I will try to actually try running repozo with this patch later

I have added an entry. But I'm not sure about the wording.

Copy link
Member

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this a lot. I've two small fixes to suggest.

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved
src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved
src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved
log('Target file smaller than full backup, '
'falling back to a full recover.')
return do_full_recover(options, repofiles)
check_startpos = int(previous_chunk[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About checking the already-restored file, which is a new concept in this tool (and the corner stone of this new feature), should it be under control of --with-verify ? Should only the last chunk be checked when --quick is also provided ?

IMHO repozo should always check the MD5, and only the last chunk, except when the main action is --verify (which then should only be needed for full-output checks).

This is the kind of implementation decisions I was free to make as long as my implementation was separate, but becomes harder to decide once both tools are merged.

with open(options.output, 'r+b') as outfp:
outfp.seek(0, 2)
initial_length = outfp.tell()
with open(datfile) as fp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't reopening the index risk disobeying find_files logic ? Especially, the --date option.

@mgedmin mgedmin self-requested a review October 22, 2024 11:51
@Sebatyne
Copy link
Contributor Author

Sorry for the long list of "fixup!" commits, I didn't think it would get that long...

To implement the feedback received in this MR, and to prevent an error on windows because a same file was opened twice, I had to rework deeply the function do_incremental_recover. I hope it is not (too much...) an issue for the review.

I have added more assertions in acadc7a, as well as a new step where I delete an old .deltafs already recovered, to prove the correctness of the code, and that it doesn't fall back silently to the full-recovery mode. I hope it will help you trust the rewriting of do_incremental_recover that happened in the latest commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants