-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve repairkit sweeper performance #62
Conversation
this prepares piecewise switch to a streaming/generator approach to updates
…mplementation of write_updated_docs()
@tloubrieu-jpl @nutjob4life any recent changes to docs generation I should know about? (re the failing branch test) |
Hi @alexdunnjpl given that adding a "column" to a document is relatively inexpensive I think it'd be certainly fine to have a repairkit/sweeper version tag that tells what release of the software touched it most recently. I'd make it a plain integer though, mostly since open search doesn't have a "version" field type, and indexing and comparisons would be cheap. I hate to hearken back to Plone but those folks did a few things right—and one was separating the notion of software version from data version (what they called a "profile" version). You might change software in ways that doesn't affect stored data at all (fixing a typo in a help message, for example). But then you might change both. They kept the profile version in a separate file (but still part of the Python package). So we've got |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is pretty good; just had one question on generator consumption.
e092ca8
to
0e5361c
Compare
@nutjob4life this sounds good - kinda reminds me of incremental db migrations. We still have the issue of "dev forgets to increment the value" but I don't see us getting away from that and the performance benefits are pretty persuasive. We could also throw a nice big obvious log message (with the sweepers profile version in use) at execution start to aid in detecting that situation. Resolution is a once-per-execution thing, yes, so anything up to a full minute is viable. |
this prevents (among other things) failure of branch tests on the runner due to incompatible sphinx version
6d0be16
to
e3ce550
Compare
Good to merge as soon as @sjoshi-jpl and @alexdunnjpl have tested in-situ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexdunnjpl
@tloubrieu-jpl did you intend to merge this before we'd tested it in-situ? |
🗒️ Summary
Improves performance of repairkit by utilizing _bulk write endpoint properly, allowing it to be feasibly run on large nodes.
Implements lazy/generator-compatible handling of updates, which should either improve memory performance outright or facilitate subsequent memory optimisation efforts.
Adds query ids to log messages, allowing disambiguation of progress updates during nested/long-running queries.
Tweaks sweepers-driver, as I've confused the execution order with the factory declaration order like five times.
⚙️ Test Data and/or Report
provenance/ancestry tested heuristically - they take slightly-less time compared to pre-change sweepers (minor speedup combined with retention of no-op updates, which are faster than if data were being changed), and small-volume spot tests of documents show presence of correct metadata
ancestry tested with functional tests - passing
repairkit currently untested - need to test against a db with unrepaired data or an ad-hoc local db with unrepaired data to be absolutely certain, though the potential for error is very low and will be easily detectable by comparing initial vs subsequent runs (whether or not writes stop happening after data is fixed)
♻️ Related Issues
required for completion of #61
@jordanpadams @tloubrieu-jpl @sjoshi-jpl I strongly suggest consideration of a metadata flag containing the most-recent version of repairkit or registry-sweepers which has touched a document. Just iterating over pds-en (resulting in no updates) takes half an hour every time sweepers executes, which could be avoided entirely by filtering the resultset db-side to just those documents which haven't been hit with a current version of repairkit. I know we've talked about that previously but I don't know if we have a ticket for it yet.
@nutjob4life any smart ideas for how to resolve the version? Candidates off the top of my head are: