Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement reindexer sweeper #149

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from
Draft

Implement reindexer sweeper #149

wants to merge 39 commits into from

Conversation

alexdunnjpl
Copy link
Contributor

@alexdunnjpl alexdunnjpl commented Oct 10, 2024

🗒️ Summary

Implements a sweeper with the following behaviour:

  • Sweeps all documents in the registry index which haven't been swept yet by this sweeper and were not been harvested since the time the sweeper began execution (to ensure consistency of queries)
  • Generates a list of fields which are present in the product documents but not present in the index mapping, or which have mappings contradicting the info available in the registry-dd index
  • For missing mappings, resolves a type from the registry-dd index if available, else defaults to keyword
  • For conflicting mappings, logs an error stating that there is a mismatch which is not resolvable in-place and which requires a manual update/migration of the documents to a new index. (Modification/deletion of mapping entries is technically impossible in ES/OS). This functionality could be built into the sweeper if desired, but it seems like an edge case so it is omitted for now.
  • Writes updates to _mappings for all missing properties
  • Writes a consistent (per-sweeper-run) timestamp to metadata field ops:Provenance/ops:reindexed_at, which is used to identify when documents were checked/fixed, and which triggers a whole-document reindex operation, ensuring that any just-added mappings become searchable for that document.
  • Logs a summary of the update operations performed

@jordanpadams @tloubrieu-jpl could you please review these requirements to ensure that they accurately reflect the desired behaviour?

⚙️ Test Data and/or Report

Manually tested:

  • logged counts are correct
  • missing mappings are added, iff they are enumerated in the DD or static (hardcoded) type mappings
  • conflicting mappings are logged
  • types of missing mappings change according to the resolved typename
  • metadata updates are written to db
  • metadata updates are sufficient to trigger re-index, causing previously-unsearchable properties to become searchable.
  • presence of metadata attribute excludes document from document set on subsequent runs
  • harvest timestamp after execution start excludes document from document set

@jordanpadams @tloubrieu-jpl having tested locally on the I&T db, would you like me to run the sweeper against one/some of the MCP indices, after the code has been reviewed?

♻️ Related Issues

resolves #148 once run/deployed
related to NASA-PDS/registry#230

… or as "keyword" if not present there

Manual tests against docker registry:
 - logged counts are correct
 - missing mappings are added
 - types of missing mappings change according to the resolved typename
 - metadata updates are written to db
 - metadata updates are sufficient to trigger re-index, causing previously-unsearchable properties to become searchable.
 - presence of metadata attribute excludes document from document set on subsequent runs
…weep do not erroneously get flagged as processed
Copy link
Member

@nutjob4life nutjob4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alexdunnjpl
Copy link
Contributor Author

per @jordanpadams review/live-test postponed until next week, after the current site demos.

Copy link
Member

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdunnjpl see comments

src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
src/pds/registrysweepers/reindexer/main.py Outdated Show resolved Hide resolved
@alexdunnjpl
Copy link
Contributor Author

@jordanpadams regarding the "needs to work with LIKE" comments, for at least some of them this is beyond the scope of this issue/PR as it relies on us reworking our design to incorporate multiple mappings per field for select fields which are intended to be typed as both keyword and text

I suggest punting those. If any of your comments amount to "this should be free-text and not keyword", then I can make those updates.

@alexdunnjpl
Copy link
Contributor Author

Status: final testing in progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to search forcassini LDD attributes in ISS datasets
3 participants