OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

nfsantos · 2024-10-28T16:48:31Z

The Mongo downloader traverses the repository by order of the fields (_modified, _id). In case of disconnection from Mongo, this allows resuming the download from where it was interrupted without redownloading any document.

However, the downloader does not need to ensure that no duplicate documents are downloaded, because the merge-sort stage of the Pipelined strategy discards duplicates. Avoiding duplicates in case of reconnections is only a performance optimization for a relatively rare occurrence.

This PR simplifies the downloader by traversing only by _modified. In case of failure, the download resumes from the last value of _modified seen. So if the last _modified value seen was 1000, the downloader will again download the documents with _modified=1000 that had previously downloaded. But this would likely take just a few seconds even in the worst case scenario. _modified has a resolution of 5 seconds, so the number of documents with the same value is limited by how much Oak can write to Mongo in a 5 seconds window. The downloader is streaming the results directly using a Mongo query, which is much faster than what Oak can write. So likely, the downloader will download all values with the same _modified value in a fraction of the time it took Oak to write them, which is an acceptable overhead in the rare case of disconnection from Mongo.

This change greatly simplifies the logic of the downloader:

No need to track both the _modified and _id fields. The reconnection logic becomes simpler. Before, on reconnection the downloader had to first do a query to finish downloading documents with the previously seen _modified (_modified=modified_last_seen, _id>id_last_seen) and then another query with _modified>modified_last_seen. Now it is enough to query for _modified>=modified_last_seen.
Slight speed gain from not having to deserialize the _id field in the downloader threads.
Potentially faster query execution on Mongo because the traverse and sort conditions are simpler: instead of traversing and ordering by (_modified, _id), it now only does it by _modified. I have not observed a significant speed-up on Mongo traversal speed, but in theory the Mongo optimizer has more freedom to optimize the new simpler query.

Other changes in the PR:

Do not use Thread interrupt to cancel download threads, instead close only the Mongo connection. Thread interrupts are very problematic
Reduce the frequency of logging of progress on the download threads.

…nsidered for download by matching only against the indexes for which the feature is enabled. Previously, it was checking against all indexes, which could lead to downloading blobs for nodes that are not indexed by an index that needs the blob. Add tests for AOT blob downloader.

…inary

… download to the transform phase, which alleviates the load on the download threads, speeding up download.

…dified_only

When writing a sorted batch of node state entries to disk, skip duplicate entries. Fix tests

… state entries.

… only the Mongo connection.

.../apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedTransformTask.java

…indexer/document/flatfile/pipelined/PipelinedTransformTask.java Co-authored-by: Fabrizio Fortino <[email protected]>

nfsantos added 30 commits September 25, 2024 09:19

Reformat

23ceb1d

Reformat

964f1c8

Merge remote-tracking branch 'upstream/trunk' into OAK-11131

9726b0e

Revert changes that are to be included in another PR.

7403545

Merge branch 'OAK-11131' into POC-mongo-cursor-binary

5d85255

Merge remote-tracking branch 'upstream/trunk' into POC-mongo-cursor-b…

5aec1d4

…inary

Merge remote-tracking branch 'upstream/trunk' into POC-mongo-cursor-b…

be88d59

…inary

Move the work of creating NodeDocuments from Mongo responses from the…

a89ff3f

… download to the transform phase, which alleviates the load on the download threads, speeding up download.

Merge remote-tracking branch 'upstream/trunk' into OAK-11158

d1052d4

Simplify logic.

ea1a97b

Fix previous commit.

55c115a

Remove no longer used class.

a148223

Download by modified only

cfe8078

Merge remote-tracking branch 'upstream/trunk' into POC_download_by_mo…

8bee529

…dified_only

Fix tests for MongoParallelDownloadCoordinator

61a4635

Merge remote-tracking branch 'upstream/trunk' into POC_download_by_mo…

c9a5f59

…dified_only

Merge remote-tracking branch 'upstream/trunk' into POC_download_by_mo…

07f2428

…dified_only

Fix tracking of last seen _modified value.

816eaa2

When writing a sorted batch of node state entries to disk, skip duplicate entries. Fix tests

Merge remote-tracking branch 'upstream/trunk' into OAK-11232

5e22a27

Improve test report

928c53d

Test

7eead81

Fix

4fcc065

Fix

c75656c

Use virtual clock on Pipelined IT tests.

83f5cdf

Merge remote-tracking branch 'upstream/trunk' into OAK-11232

508f1b7

Merge remote-tracking branch 'upstream/trunk' into OAK-11232

2b6f8ea

De-duplicate entries when writing sorted intermediate batches of node…

995dad8

… state entries.

Merge remote-tracking branch 'upstream/trunk' into OAK-11232

e93003b

Do not use Thread interrupt to cancel download threads, instead close…

9f5adcc

… only the Mongo connection.

nfsantos added 6 commits October 30, 2024 17:31

Revert changes to NodeDocument

bde5e5a

Decrease frequency of logging progress in download task

3b2b1eb

Merge remote-tracking branch 'upstream/trunk' into OAK-11232

3b7e391

Improve documentation.

4b9e8fa

Fix

b4743fd

Add documentation

8baa553

fabriziofortino approved these changes Oct 31, 2024

View reviewed changes

.../apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedTransformTask.java Outdated Show resolved Hide resolved

Update oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/…

996e658

…indexer/document/flatfile/pipelined/PipelinedTransformTask.java Co-authored-by: Fabrizio Fortino <[email protected]>

thomasmueller approved these changes Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

nfsantos commented Oct 28, 2024 •

edited

Loading

OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

Are you sure you want to change the base?

OAK-11232 - indexing-job - Simplify download from Mongo logic by traversing only by _modified instead of (_modified, _id) #1827

Conversation

nfsantos commented Oct 28, 2024 • edited Loading

nfsantos commented Oct 28, 2024 •

edited

Loading