-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for missing HybridQuery results when concurrent segment search is enabled #800
Fix for missing HybridQuery results when concurrent segment search is enabled #800
Conversation
401f070
to
6570bcb
Compare
src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java
Show resolved
Hide resolved
BWC for 2.15 will keep failing because main is still pointing to snapshot and as per our process we're changing that after the release. |
a3a09b1
to
1c8f50a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #800 +/- ##
============================================
+ Coverage 85.02% 85.21% +0.19%
- Complexity 790 856 +66
============================================
Files 60 68 +8
Lines 2430 2686 +256
Branches 410 432 +22
============================================
+ Hits 2066 2289 +223
- Misses 202 222 +20
- Partials 162 175 +13 ☔ View full report in Codecov by Sentry. |
1c8f50a
to
838cdfb
Compare
b492845
to
5c8bcc0
Compare
src/main/java/org/opensearch/neuralsearch/search/util/ScoreDocsMerger.java
Outdated
Show resolved
Hide resolved
* Utility class for merging TopDocs and MaxScore across multiple search queries | ||
*/ | ||
@RequiredArgsConstructor | ||
public class TopDocsMerger { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: HybridQueryTopDocsMerger. This class is exclusive for Hybrid Query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually generic, there isn't any special logic for hybrid query. Will leave name as it's now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge method is exclusive for hybrid query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it merges two TopDocsAndMaxScore object, only specific of hybrid query is in ScoreDocs merger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @martin-gaievski
src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java
Show resolved
Hide resolved
Overall looks good to me. Just waiting for @navneet1v review. |
Signed-off-by: Martin Gaievski <[email protected]>
a853197
to
1ed247e
Compare
@martin-gaievski Opensearch defines more control on top of this, to how to slice the segments. Ref: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms. are you taking any assumptions based on lucene slicing mechanism? |
src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/search/util/HybridQueryScoreDocsMerger.java
Outdated
Show resolved
Hide resolved
thanks for sharing that link, I was not aware of the OpenSearch specific approach. Anyhow, I'm not making any assumptions based on number of segments or any other condition of how the slices are constructed. Main principle will be same - it will be either one collector or many collectors with search results. Previously we were handling results from one collector, now we can handle scenario with multiple collectors. |
ed4ee13
to
41c5de5
Compare
Ran a benchmark of https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/noaa_semantic_search on main with and without this change, here are results for medium and large queries, where we have up to 8 segments and where the impact should be the visible the most: With the change the performance is actually better. I'm taking one more round for baseline to make sure data isn't flaky, but these are preliminary results. baseline:
with this change
|
…or param Signed-off-by: Martin Gaievski <[email protected]>
41c5de5
to
d7bb73a
Compare
LGTM |
25d2e82
into
opensearch-project:main
… enabled (#800) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82)
… enabled (#800) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82)
I re-ran the benchmark, it shows no change for large sub-set and 5% delta for medium sub-set. that looks more realistic as we only add some computation. baseline:
after the change
|
… enabled (#800) (#805) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82) Co-authored-by: Martin Gaievski <[email protected]>
… enabled (#800) (#804) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82) Co-authored-by: Martin Gaievski <[email protected]>
Description
Fixed gap in implementation for case when shard has multiple (6+) segments. In such case some hits were missing in final hybrid query result.
Issue caused by wrong assumption that collector manager can have only one hybrid query result collector in reduce phase. It's true for number of segments < 6, otherwise core passes multiple collectors with a portion of result each. We have to merge those results at shard level to have correct collection of hits.
Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.
Logic for merge is a bit tricky because we have to deal with TopDocs that has been formatted to special hybrid query format. Each next collector result should be merge into query result one by one. On each merge we need to find results of one sub-query and merge then separately, then wrap into hybrid query result format.
Example:
TopDocs in query result:
result from next block of segments:
merged result:
Added extensive list of unit tests and integ test that would fail with old logic (assertion of total hits, actual number would be lower).
Issues Resolved
#799
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.