Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve repairkit sweeper performance #62

Merged
merged 11 commits into from
Aug 30, 2023
Merged

Conversation

alexdunnjpl
Copy link
Contributor

@alexdunnjpl alexdunnjpl commented Aug 30, 2023

🗒️ Summary

Improves performance of repairkit by utilizing _bulk write endpoint properly, allowing it to be feasibly run on large nodes.

Implements lazy/generator-compatible handling of updates, which should either improve memory performance outright or facilitate subsequent memory optimisation efforts.

Adds query ids to log messages, allowing disambiguation of progress updates during nested/long-running queries.

Tweaks sweepers-driver, as I've confused the execution order with the factory declaration order like five times.

⚙️ Test Data and/or Report

provenance/ancestry tested heuristically - they take slightly-less time compared to pre-change sweepers (minor speedup combined with retention of no-op updates, which are faster than if data were being changed), and small-volume spot tests of documents show presence of correct metadata

ancestry tested with functional tests - passing

repairkit currently untested - need to test against a db with unrepaired data or an ad-hoc local db with unrepaired data to be absolutely certain, though the potential for error is very low and will be easily detectable by comparing initial vs subsequent runs (whether or not writes stop happening after data is fixed)

♻️ Related Issues

required for completion of #61

@jordanpadams @tloubrieu-jpl @sjoshi-jpl I strongly suggest consideration of a metadata flag containing the most-recent version of repairkit or registry-sweepers which has touched a document. Just iterating over pds-en (resulting in no updates) takes half an hour every time sweepers executes, which could be avoided entirely by filtering the resultset db-side to just those documents which haven't been hit with a current version of repairkit. I know we've talked about that previously but I don't know if we have a ticket for it yet.

@nutjob4life any smart ideas for how to resolve the version? Candidates off the top of my head are:

  • hardcode a constant value within repairkit, which is best from a performance perspective but problematic due to the potential for devs to miss incrementing it when making changes to repairkit
  • resolve the value from the auto-versioned package value, which is nice but probably too slow for deployment given that we're still on 1.1.0
  • resolve the value from the environment, perhaps? I was thinking to use the docker image id, but that's not actually available within a container by default.

@alexdunnjpl
Copy link
Contributor Author

alexdunnjpl commented Aug 30, 2023

@tloubrieu-jpl @nutjob4life any recent changes to docs generation I should know about? (re the failing branch test)

@alexdunnjpl alexdunnjpl marked this pull request as ready for review August 30, 2023 01:52
@alexdunnjpl alexdunnjpl marked this pull request as draft August 30, 2023 01:52
@alexdunnjpl alexdunnjpl marked this pull request as ready for review August 30, 2023 03:13
@nutjob4life
Copy link
Member

Hi @alexdunnjpl given that adding a "column" to a document is relatively inexpensive I think it'd be certainly fine to have a repairkit/sweeper version tag that tells what release of the software touched it most recently. I'd make it a plain integer though, mostly since open search doesn't have a "version" field type, and indexing and comparisons would be cheap.

I hate to hearken back to Plone but those folks did a few things right—and one was separating the notion of software version from data version (what they called a "profile" version). You might change software in ways that doesn't affect stored data at all (fixing a typo in a help message, for example). But then you might change both. They kept the profile version in a separate file (but still part of the Python package).

So we've got VERSION.txt for software version like 1.1.0. We could add a PROFILE.txt with something like 3 in it. It's a one-time read at start up, right?

Copy link
Member

@nutjob4life nutjob4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is pretty good; just had one question on generator consumption.

src/pds/registrysweepers/ancestry/__init__.py Outdated Show resolved Hide resolved
@alexdunnjpl
Copy link
Contributor Author

alexdunnjpl commented Aug 30, 2023

So we've got VERSION.txt for software version like 1.1.0. We could add a PROFILE.txt with something like 3 in it. It's a one-time read at start up, right?

@nutjob4life this sounds good - kinda reminds me of incremental db migrations.

We still have the issue of "dev forgets to increment the value" but I don't see us getting away from that and the performance benefits are pretty persuasive. We could also throw a nice big obvious log message (with the sweepers profile version in use) at execution start to aid in detecting that situation.

Resolution is a once-per-execution thing, yes, so anything up to a full minute is viable.

this prevents (among other things) failure of branch tests on the runner due to incompatible sphinx version
@alexdunnjpl
Copy link
Contributor Author

Good to merge as soon as @sjoshi-jpl and @alexdunnjpl have tested in-situ

Copy link
Member

@tloubrieu-jpl tloubrieu-jpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexdunnjpl

@tloubrieu-jpl tloubrieu-jpl merged commit 3f8b796 into main Aug 30, 2023
1 check passed
@tloubrieu-jpl tloubrieu-jpl deleted the repairkit-repair-kit branch August 30, 2023 23:31
@alexdunnjpl
Copy link
Contributor Author

@tloubrieu-jpl did you intend to merge this before we'd tested it in-situ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants