[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

sandeshkr419 · 2024-07-22T19:35:15Z

With support of star-tree composite of indices, we would to resolve certain aggregation & search paths via star-tree itself. Thinking of 2 possibilities:

Introduce a new search request which explicitly specifies star tree in request construct - this means user learning a new capability or query type. The biggest pitfall in this approach is that the query has to be formed in a certain way as well that it executes via star-tree.
Internally identify query shapes, which can be resolved using star tree and then internally parse the request to a StarTreeQuery & StarTreeAggregator - in this way the user does not has to worry about the queries which can be or cannot be resolved using star tree. In this way, the default search path can follow in case the query cannot be resolved via star tree. This does not involves user intervention once they have created the star tree during index creation. Discussion to disable search via star-tree is continued on [Search] [Star Tree] Option/Param to Disable search via star-tree #14872

I'm in support if approach 2/ as in this way we do not introduce a new overhead for search users to reframe their queries to star-tree. Also, as a feature in development, the full search capabilities of using star tree will be developed incrementally.

For 2/, we would want to keep the search request & search response intact. One such request/response to start building up the framework for star tree request execution will be an aggregation request with groupby/nested aggregation.

In a default search path execution, the query and aggregation path are independently executed. In the star tree code path, the query and aggregation will be tightly coupled and this requires decision making on setting up correct star-tree query & aggregation pair during request parsing itself.

Been thinking something similar to a poc I did here, to create a query/aggregation pair with request parsing itself.

Sample Search Aggregation Request:

{
    "size": 0,
    "aggs": {
                "group_by_clientip": {
                    "terms": {
                        "field": "status"
                    },
                    "aggs": {
                        "max_status": {
                            "sum": {
                                "field": "size"
                            }
                        }
                    }
        }
    }
}

Sample Response Expected:

(this is non-star tree response):

{
    "took": 1971,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1018,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "group_by_clientip": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 484,
            "buckets": [
                {
                    "key": 209,
                    "doc_count": 63,
                    "max_status": {
                        "value": 39844.0
                    }
                },
                {
                    "key": 217,
                    "doc_count": 58,
                    "max_status": {
                        "value": 35754.0
                    }
                },
                {
                    "key": 201,
                    "doc_count": 53,
                    "max_status": {
                        "value": 32154.0
                    }
                },
                {
                    "key": 220,
                    "doc_count": 53,
                    "max_status": {
                        "value": 28525.0
                    }
                },
                {
                    "key": 208,
                    "doc_count": 52,
                    "max_status": {
                        "value": 33543.0
                    }
                },
                {
                    "key": 210,
                    "doc_count": 52,
                    "max_status": {
                        "value": 29480.0
                    }
                },
                {
                    "key": 213,
                    "doc_count": 52,
                    "max_status": {
                        "value": 28841.0
                    }
                },
                {
                    "key": 202,
                    "doc_count": 51,
                    "max_status": {
                        "value": 31026.0
                    }
                },
                {
                    "key": 216,
                    "doc_count": 51,
                    "max_status": {
                        "value": 31242.0
                    }
                },
                {
                    "key": 204,
                    "doc_count": 49,
                    "max_status": {
                        "value": 30802.0
                    }
                }
            ]
        }
    }
}

The text was updated successfully, but these errors were encountered:

bharath-techie · 2024-07-23T11:58:20Z

Hi @sandeshkr419,
Thanks for raising the issue.

We need an extensive design for the following features :

1. Auto detection of queries that can be solved via star tree at request / index / shard level

From query and aggregation parts of the user query - Parse the dimensions and filters on each dimensions
From aggregations part of the user query - Parse the required metrics

We need more details on how we can solve this for one query / aggregation and how this can be extended to other queries / aggregations easily. [ Extensibility of the approach to newer queries / aggregations ]

And secondly validations + decisions on whether star tree index can be solved for the given query and if so which star tree index to be used.

Since we have built indexing flows from ground up with support for multiple star tree indices per index, we need a high level plan/design on how search flows will decide on which star tree to use based on the configuration.

2. Segment level decisions

Similar to 'IndexOrDocValuesQuery', we might need 'OriginalOrStarTreeQuery' as there are certain scenarios where original query must be used.
For example,
- It could be based on 'cost'
- Based on factors like 'cardinality' [ say star tree index has too many documents, or star tree docs > total segment docs ]
- Deleted documents are present in segment etc

In this case, we can't take the decision right at request parsing stage as proposed in this doc as the aggregators to be used depends on the query used ( as star tree documents and segment documents differ , and also aggregators differ )
We might need to take decision for each leaf reader context based on the query or any other method where we need to take decisions of which aggregators used based on which query is used for the segment

3. Response correctness and scenarios

Star tree index can be used to solve nested aggregations, multiple aggregations in an efficient way as a single document contains keys of the buckets [dimensions] and resultant metrics of the keys.

We currently return nested response to users when users query with nested aggregations and we have to pass via multiple aggregators.
We can cover how we're handling the same with star tree query/aggregator. The response structure must be same.

Size
Some of the aggregations like terms aggregation / multi terms aggregations returns the top n results as buckets based on size parameter in request ordered based on the doc count.
Ordering
Most of the aggregations seems to be ordered based on doc count etc. We need to check the ordering and see how it works with single aggregations, multiple and nested aggregations.

We need a plan on how we are solving the same in star tree query flow using star tree index structures.

4. Star tree query algorithm

This can cover the actual algorithm that is used to traverse the star tree to solve queries and the limitations of the algorithms etc.

If possible, scorer and weight structure of the query which is used along with the algorithm.

Some of the things to cover :

We need to check how conversions work for the star tree fields and how we can query the same.

During indexing , dimensions are converted to long regardless of the original index field's data type ( similar to current fields as well in SortedNDV ).
Metrics are saved based on metric aggregator type ( 'sum', 'min', 'max' saves the values as double whereas count is stored as 'long' )

Details on how we query and combine the results of dimensions present as part of star tree and for dimensions not present as part of star tree.

Please raise issues for the above and more such broader issues that needs to be solved and lets close on the approaches for the same.

sandeshkr419 · 2024-07-30T03:25:22Z

In my poc approach, I was trying to parse the request at coordinator/request level itself to appropriate star tree query/aggregator pair. Since cluster state is available on all nodes, and keep in mind introduction of more than 1 star tree for an index, it makes sense to do the parsing at shard level directly to avoid unnecessary transport traffic.

As @bharath-techie suggested I think we will have to introduce a new composite OriginalOrStarTreeQuery[StarTreeQuery(s), OriginalQuery] and OriginalOrStarTreeAggregator[StarTreeAggregator(s), OriginalAggregator], in that way we can take decision at a segment level [1/ deleted docs, 2/ doc_count field, 3/ any other cost estimation] to whether execute star tree query/aggregator or not.

Need to figure out next the entry point at shard level for query/aggregation rewrite in this case.

sandeshkr419 · 2024-08-13T03:22:58Z

Revisiting my POC changes, I started with resolving a single level (no sub) metric aggregation with/without a numeric terms query.

Example query:

{
    "query": {
        "term": {
            "status": 200
        }
    },
    "size": 0,
    "aggs": {
                        "sum_status": {
                            "sum": {
                                "field": "size"
                            }
                        }
                    }
}

I ditched using a separate aggregator class setup altogether and made use of existing metric aggregators so that I can utilize them entirely. However, I still kept the usage of OriginalOrStarTreeQuery to preserve al the information required to resolve a star tree query t a single place.

This made the flow of code much simpler. Here is a draft PR with the approach (raised against a private fork because of depending changes #14809 are still in review): sandeshkr419#227

sandeshkr419 · 2024-11-11T18:37:22Z

Merged #15289

Other query shapes have separate issues opened for further discussion.

sandeshkr419 added this to Performance Roadmap Jul 18, 2024

sandeshkr419 self-assigned this Jul 22, 2024

sandeshkr419 converted this from a draft issue Jul 22, 2024

github-actions bot added the untriaged label Jul 22, 2024

sandeshkr419 mentioned this issue Jul 22, 2024

[Search] [Star Tree] Option/Param to Disable search via star-tree #14872

Closed

sandeshkr419 added the Search Search query, autocomplete ...etc label Jul 23, 2024

github-project-automation bot added this to Search Project Board Jul 23, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Jul 23, 2024

mch2 removed the untriaged label Jul 24, 2024

sandeshkr419 mentioned this issue Aug 15, 2024

[META] [Star Tree] [Search] Supported Queries to resolve via star tree #15257

Open

sandeshkr419 added the Roadmap:Search Project-wide roadmap label label Sep 11, 2024

opensearch-infra bot added this to OpenSearch Roadmap Sep 11, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Sep 11, 2024

sandeshkr419 added the v2.18.0 Issues and PRs related to version 2.18.0 label Sep 11, 2024

sandeshkr419 changed the title ~~[Star Tree] Parse aggregation request to star tree query & star tree aggregation~~ [Star Tree] Parse aggregation request to resolve via star tree data structure Sep 12, 2024

sandeshkr419 changed the title ~~[Star Tree] Parse aggregation request to resolve via star tree data structure~~ [Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure Sep 12, 2024

sandeshkr419 closed this as completed Nov 11, 2024

github-project-automation bot moved this from In Progress to Done in Performance Roadmap Nov 11, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

sandeshkr419 commented Jul 22, 2024 •

edited

Loading

bharath-techie commented Jul 23, 2024

sandeshkr419 commented Jul 30, 2024 •

edited

Loading

sandeshkr419 commented Aug 13, 2024

sandeshkr419 commented Nov 11, 2024

[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

Comments

sandeshkr419 commented Jul 22, 2024 • edited Loading

bharath-techie commented Jul 23, 2024

1. Auto detection of queries that can be solved via star tree at request / index / shard level

2. Segment level decisions

3. Response correctness and scenarios

4. Star tree query algorithm

sandeshkr419 commented Jul 30, 2024 • edited Loading

sandeshkr419 commented Aug 13, 2024

sandeshkr419 commented Nov 11, 2024

sandeshkr419 commented Jul 22, 2024 •

edited

Loading

sandeshkr419 commented Jul 30, 2024 •

edited

Loading