Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

Open
wants to merge 37 commits into
base: trunk
Choose a base branch
from

Conversation

nfsantos
Copy link
Contributor

@nfsantos nfsantos commented Oct 28, 2024

The Mongo downloader traverses the repository by order of the fields (_modified, _id). In case of disconnection from Mongo, this allows resuming the download from where it was interrupted without redownloading any document.

However, the downloader does not need to ensure that no duplicate documents are downloaded, because the merge-sort stage of the Pipelined strategy discards duplicates. Avoiding duplicates in case of reconnections is only a performance optimization for a relatively rare occurrence.

This PR simplifies the downloader by traversing only by _modified. In case of failure, the download resumes from the last value of _modified seen. So if the last _modified value seen was 1000, the downloader will again download the documents with _modified=1000 that had previously downloaded. But this would likely take just a few seconds even in the worst case scenario. _modified has a resolution of 5 seconds, so the number of documents with the same value is limited by how much Oak can write to Mongo in a 5 seconds window. The downloader is streaming the results directly using a Mongo query, which is much faster than what Oak can write. So likely, the downloader will download all values with the same _modified value in a fraction of the time it took Oak to write them, which is an acceptable overhead in the rare case of disconnection from Mongo.

This change greatly simplifies the logic of the downloader:

  • No need to track both the _modified and _id fields. The reconnection logic becomes simpler. Before, on reconnection the downloader had to first do a query to finish downloading documents with the previously seen _modified (_modified=modified_last_seen, _id>id_last_seen) and then another query with _modified>modified_last_seen. Now it is enough to query for _modified>=modified_last_seen.
  • Slight speed gain from not having to deserialize the _id field in the downloader threads.
  • Potentially faster query execution on Mongo because the traverse and sort conditions are simpler: instead of traversing and ordering by (_modified, _id), it now only does it by _modified. I have not observed a significant speed-up on Mongo traversal speed, but in theory the Mongo optimizer has more freedom to optimize the new simpler query.

Other changes in the PR:

  • Do not use Thread interrupt to cancel download threads, instead close only the Mongo connection. Thread interrupts are very problematic
  • Reduce the frequency of logging of progress on the download threads.

…nsidered for download by matching only against the indexes for which the feature is enabled. Previously, it was checking against all indexes, which could lead to downloading blobs for nodes that are not indexed by an index that needs the blob.

Add tests for AOT blob downloader.
… download to the transform phase, which alleviates the load on the download threads, speeding up download.
When writing a sorted batch of node state entries to disk, skip duplicate entries.
Fix tests
…indexer/document/flatfile/pipelined/PipelinedTransformTask.java

Co-authored-by: Fabrizio Fortino <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants