Fix for missing HybridQuery results when concurrent segment search is enabled #800

martin-gaievski · 2024-06-21T00:58:05Z

Description

Fixed gap in implementation for case when shard has multiple (6+) segments. In such case some hits were missing in final hybrid query result.
Issue caused by wrong assumption that collector manager can have only one hybrid query result collector in reduce phase. It's true for number of segments < 6, otherwise core passes multiple collectors with a portion of result each. We have to merge those results at shard level to have correct collection of hits.

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

Logic for merge is a bit tricky because we have to deal with TopDocs that has been formatted to special hybrid query format. Each next collector result should be merge into query result one by one. On each merge we need to find results of one sub-query and merge then separately, then wrap into hybrid query result format.

Example:
TopDocs in query result:

query1: {doc1: 10, doc3: 5, doc5: 3}
query2: {doc2: 3, doc3: 1}

result from next block of segments:

query1: {doc2: 11, doc6: 2}
query2: {doc5: 2, doc6: 1}

merged result:

query1: {doc2:11, doc1: 10, doc3: 5, doc5: 3, doc6: 2}
query2: {doc2: 3, doc5: 2, doc3:1, doc6: 1}

Added extensive list of unit tests and integ test that would fail with old logic (assertion of total hits, actual number would be lower).

Issues Resolved

#799

Check List

New functionality includes testing.
- [] All tests pass
New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java

martin-gaievski · 2024-06-21T01:27:13Z

BWC for 2.15 will keep failing because main is still pointing to snapshot and as per our process we're changing that after the release.
Security check CI action will also fail for jdk 11 and 17, core has changed min requirement for security plugin to jdk 21. CI for jdk 21 is passing. Created PR to address it for plugin: #801

codecov · 2024-06-21T16:42:33Z

Codecov Report

Attention: Patch coverage is 82.27848% with 14 lines in your changes missing coverage. Please review.

Project coverage is 85.21%. Comparing base (7c54c86) to head (ed4ee13).
Report is 14 commits behind head on main.

❗ Current head ed4ee13 differs from pull request most recent head d7bb73a

Please upload reports for the commit d7bb73a to get more accurate results.

Files	Patch %	Lines
...earch/search/query/HybridQueryScoreDocsMerger.java	82.14%	0 Missing and 5 partials ⚠️
...ralsearch/search/query/HybridCollectorManager.java	87.87%	2 Missing and 2 partials ⚠️
...arch/search/util/HybridSearchResultFormatUtil.java	33.33%	2 Missing and 2 partials ⚠️
...earch/neuralsearch/search/query/TopDocsMerger.java	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #800      +/-   ##
============================================
+ Coverage     85.02%   85.21%   +0.19%     
- Complexity      790      856      +66     
============================================
  Files            60       68       +8     
  Lines          2430     2686     +256     
  Branches        410      432      +22     
============================================
+ Hits           2066     2289     +223     
- Misses          202      222      +20     
- Partials        162      175      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/opensearch/neuralsearch/search/util/ScoreDocsMerger.java

vibrantvarun · 2024-06-24T22:30:01Z

src/main/java/org/opensearch/neuralsearch/search/util/TopDocsMerger.java

+ * Utility class for merging TopDocs and MaxScore across multiple search queries
+ */
+@RequiredArgsConstructor
+public class TopDocsMerger {


nit: HybridQueryTopDocsMerger. This class is exclusive for Hybrid Query.

it's actually generic, there isn't any special logic for hybrid query. Will leave name as it's now

The merge method is exclusive for hybrid query.

Not really, it merges two TopDocsAndMaxScore object, only specific of hybrid query is in ScoreDocs merger.

+1 to @martin-gaievski

CHANGELOG.md

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java

vibrantvarun · 2024-06-25T01:30:43Z

Overall looks good to me. Just waiting for @navneet1v review.

Signed-off-by: Martin Gaievski <[email protected]>

navneet1v · 2024-06-25T06:35:43Z

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

@martin-gaievski Opensearch defines more control on top of this, to how to slice the segments. Ref: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms. are you taking any assumptions based on lucene slicing mechanism?

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java

src/main/java/org/opensearch/neuralsearch/search/util/HybridQueryScoreDocsMerger.java

martin-gaievski · 2024-06-25T15:24:41Z

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

@martin-gaievski Opensearch defines more control on top of this, to how to slice the segments. Ref: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms. are you taking any assumptions based on lucene slicing mechanism?

thanks for sharing that link, I was not aware of the OpenSearch specific approach. Anyhow, I'm not making any assumptions based on number of segments or any other condition of how the slices are constructed. Main principle will be same - it will be either one collector or many collectors with search results. Previously we were handling results from one collector, now we can handle scenario with multiple collectors.

martin-gaievski · 2024-06-25T17:13:04Z

Ran a benchmark of https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/noaa_semantic_search on main with and without this change, here are results for medium and large queries, where we have up to 8 segments and where the impact should be the visible the most:

With the change the performance is actually better. I'm taking one more round for baseline to make sure data isn't flaky, but these are preliminary results.

baseline:

   {
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.011052370071411,
     "mean": 2.021946349143982,
     "median": 2.018242835998535,
     "max": 2.0519936084747314,
     "unit": "ops/s"
    },    
    "service_time": {
     "50_0": 91.03496170043945,
     "90_0": 96.44083023071289,
     "99_0": 98.4694938659668,
     "100_0": 98.7366943359375,
     "mean": 91.10263374328613,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
-----
  {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0065014362335205,
     "mean": 2.01287588596344,
     "median": 2.010721802711487,
     "max": 2.0304625034332275,
     "unit": "ops/s"
    }
    "service_time": {
     "50_0": 188.2995376586914,
     "90_0": 204.8991470336914,
     "99_0": 218.81844329833984,
     "100_0": 222.81707763671875,
     "mean": 186.44740097045897,
     "unit": "ms"
    },
    "error_rate": 0.0
   },

with this change

   {
    "task": "hybrid-query-only-range",
    "operation": "hybrid-query-only-range",
    "throughput": {
     "min": 2.0113322734832764,
     "mean": 2.0225303030014037,
     "median": 2.0187013149261475,
     "max": 2.053476572036743,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 72.53012466430664,
     "90_0": 73.0995979309082,
     "99_0": 89.77497482299805,
     "100_0": 105.39061737060547,
     "mean": 72.93198219299316,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
---
  {
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0104377269744873,
     "mean": 2.0207511043548583,
     "median": 2.0172306299209595,
     "max": 2.0493412017822266,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 100.66146469116211,
     "90_0": 104.38032913208008,
     "99_0": 106.95694351196289,
     "100_0": 107.02909088134766,
     "mean": 100.12281967163086,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

…or param Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun · 2024-06-25T17:55:05Z

LGTM

… enabled (#800) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82)

martin-gaievski · 2024-06-25T18:37:26Z

I re-ran the benchmark, it shows no change for large sub-set and 5% delta for medium sub-set. that looks more realistic as we only add some computation.

baseline:

{
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0109918117523193,
     "mean": 2.0218198919296264,
     "median": 2.0181103944778442,
     "max": 2.051851272583008,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 96.55980682373047,
     "90_0": 99.6461067199707,
     "99_0": 102.91871643066406,
     "100_0": 103.3681869506836,
     "mean": 95.76778579711915,
     "unit": "ms"
    },
    "duration": 62003.01305705216
   }
---
   {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0072591304779053,
     "mean": 2.014420256614685,
     "median": 2.0119868516921997,
     "max": 2.0340819358825684,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 196.891845703125,
     "90_0": 209.67161560058594,
     "99_0": 224.50284576416016,
     "100_0": 227.4385986328125,
     "mean": 195.78350692749024,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

after the change

{
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0046942234039307,
     "mean": 2.0093229246139526,
     "median": 2.0077611207962036,
     "max": 2.0221428871154785,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 88.03913879394531,
     "90_0": 93.1028938293457,
     "99_0": 104.01826095581055,
     "100_0": 106.14491271972656,
     "mean": 88.63995048522949,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
---
   {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0080811977386475,
     "mean": 2.0161024808883665,
     "median": 2.0133992433547974,
     "max": 2.0380523204803467,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 200.56720733642578,
     "90_0": 215.31163024902344,
     "99_0": 229.7359161376953,
     "100_0": 233.6544952392578,
     "mean": 196.9455467224121,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

… enabled (#800) (#805) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82) Co-authored-by: Martin Gaievski <[email protected]>

… enabled (#800) (#804) * Adding merge logic for multiple collector result case Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 25d2e82) Co-authored-by: Martin Gaievski <[email protected]>

martin-gaievski added bug Something isn't working backport 2.x Label will add auto workflow to backport PR to 2.x branch hybrid search labels Jun 21, 2024

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch 2 times, most recently from 401f070 to 6570bcb Compare June 21, 2024 01:08

martin-gaievski commented Jun 21, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java Show resolved Hide resolved

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch 3 times, most recently from a3a09b1 to 1c8f50a Compare June 21, 2024 16:23

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from 1c8f50a to 838cdfb Compare June 21, 2024 18:03

martin-gaievski added the v2.16.0 label Jun 21, 2024

martin-gaievski marked this pull request as ready for review June 21, 2024 18:24

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners June 21, 2024 18:24

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch 2 times, most recently from b492845 to 5c8bcc0 Compare June 21, 2024 18:48

vibrantvarun reviewed Jun 24, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/search/util/ScoreDocsMerger.java Outdated Show resolved Hide resolved

vibrantvarun reviewed Jun 24, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

vibrantvarun reviewed Jun 24, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java Show resolved Hide resolved

martin-gaievski changed the title ~~Fixed merge logic for multiple collector result case~~ Fix for missing HybridQuery results when concurrent segment search is enabled Jun 24, 2024

martin-gaievski requested a review from vibrantvarun June 24, 2024 23:47

Some refactoring after code review

1ed247e

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from a853197 to 1ed247e Compare June 25, 2024 04:16

navneet1v reviewed Jun 25, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/search/query/HybridCollectorManager.java Outdated Show resolved Hide resolved

navneet1v reviewed Jun 25, 2024

View reviewed changes

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from ed4ee13 to 41c5de5 Compare June 25, 2024 16:52

Made merger classes package private, make topdocs merger as construct…

d7bb73a

…or param Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from 41c5de5 to d7bb73a Compare June 25, 2024 17:30

vibrantvarun approved these changes Jun 25, 2024

View reviewed changes

navneet1v approved these changes Jun 25, 2024

View reviewed changes

martin-gaievski merged commit 25d2e82 into opensearch-project:main Jun 25, 2024
62 of 69 checks passed

opensearch-trigger-bot bot mentioned this pull request Jun 25, 2024

[Backport 2.x] Fix for missing HybridQuery results when concurrent segment search is enabled #804

Merged

martin-gaievski added the backport 2.15 label Jun 25, 2024

opensearch-trigger-bot bot mentioned this pull request Jun 25, 2024

[Backport 2.15] Fix for missing HybridQuery results when concurrent segment search is enabled #805

Merged

vibrantvarun mentioned this pull request Jun 28, 2024

[Part 3] Concurrent segment search bug in Sorting #808

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for missing HybridQuery results when concurrent segment search is enabled #800

Fix for missing HybridQuery results when concurrent segment search is enabled #800

martin-gaievski commented Jun 21, 2024 •

edited

Loading

martin-gaievski commented Jun 21, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading

vibrantvarun Jun 24, 2024

martin-gaievski Jun 24, 2024

vibrantvarun Jun 25, 2024

martin-gaievski Jun 25, 2024

vibrantvarun Jun 25, 2024

vibrantvarun commented Jun 25, 2024

navneet1v commented Jun 25, 2024 •

edited

Loading

martin-gaievski commented Jun 25, 2024

martin-gaievski commented Jun 25, 2024

vibrantvarun commented Jun 25, 2024

martin-gaievski commented Jun 25, 2024

Fix for missing HybridQuery results when concurrent segment search is enabled #800

Fix for missing HybridQuery results when concurrent segment search is enabled #800

Conversation

martin-gaievski commented Jun 21, 2024 • edited Loading

Description

Issues Resolved

Check List

martin-gaievski commented Jun 21, 2024 • edited Loading

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

vibrantvarun Jun 24, 2024

Choose a reason for hiding this comment

martin-gaievski Jun 24, 2024

Choose a reason for hiding this comment

vibrantvarun Jun 25, 2024

Choose a reason for hiding this comment

martin-gaievski Jun 25, 2024

Choose a reason for hiding this comment

vibrantvarun Jun 25, 2024

Choose a reason for hiding this comment

vibrantvarun commented Jun 25, 2024

navneet1v commented Jun 25, 2024 • edited Loading

martin-gaievski commented Jun 25, 2024

martin-gaievski commented Jun 25, 2024

vibrantvarun commented Jun 25, 2024

martin-gaievski commented Jun 25, 2024

martin-gaievski commented Jun 21, 2024 •

edited

Loading

martin-gaievski commented Jun 21, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading

navneet1v commented Jun 25, 2024 •

edited

Loading