Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure #14871

Closed
sandeshkr419 opened this issue Jul 22, 2024 · 4 comments
Assignees
Labels
Roadmap:Search Project-wide roadmap label Search Search query, autocomplete ...etc v2.18.0 Issues and PRs related to version 2.18.0

Comments

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Jul 22, 2024

With support of star-tree composite of indices, we would to resolve certain aggregation & search paths via star-tree itself. Thinking of 2 possibilities:

  1. Introduce a new search request which explicitly specifies star tree in request construct - this means user learning a new capability or query type. The biggest pitfall in this approach is that the query has to be formed in a certain way as well that it executes via star-tree.
  2. Internally identify query shapes, which can be resolved using star tree and then internally parse the request to a StarTreeQuery & StarTreeAggregator - in this way the user does not has to worry about the queries which can be or cannot be resolved using star tree. In this way, the default search path can follow in case the query cannot be resolved via star tree. This does not involves user intervention once they have created the star tree during index creation. Discussion to disable search via star-tree is continued on [Search] [Star Tree] Option/Param to Disable search via star-tree #14872

I'm in support if approach 2/ as in this way we do not introduce a new overhead for search users to reframe their queries to star-tree. Also, as a feature in development, the full search capabilities of using star tree will be developed incrementally.

For 2/, we would want to keep the search request & search response intact. One such request/response to start building up the framework for star tree request execution will be an aggregation request with groupby/nested aggregation.

In a default search path execution, the query and aggregation path are independently executed. In the star tree code path, the query and aggregation will be tightly coupled and this requires decision making on setting up correct star-tree query & aggregation pair during request parsing itself.

Been thinking something similar to a poc I did here, to create a query/aggregation pair with request parsing itself.

Sample Search Aggregation Request:

{
    "size": 0,
    "aggs": {
                "group_by_clientip": {
                    "terms": {
                        "field": "status"
                    },
                    "aggs": {
                        "max_status": {
                            "sum": {
                                "field": "size"
                            }
                        }
                    }
        }
    }
}

Sample Response Expected:

(this is non-star tree response):

{
    "took": 1971,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1018,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "group_by_clientip": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 484,
            "buckets": [
                {
                    "key": 209,
                    "doc_count": 63,
                    "max_status": {
                        "value": 39844.0
                    }
                },
                {
                    "key": 217,
                    "doc_count": 58,
                    "max_status": {
                        "value": 35754.0
                    }
                },
                {
                    "key": 201,
                    "doc_count": 53,
                    "max_status": {
                        "value": 32154.0
                    }
                },
                {
                    "key": 220,
                    "doc_count": 53,
                    "max_status": {
                        "value": 28525.0
                    }
                },
                {
                    "key": 208,
                    "doc_count": 52,
                    "max_status": {
                        "value": 33543.0
                    }
                },
                {
                    "key": 210,
                    "doc_count": 52,
                    "max_status": {
                        "value": 29480.0
                    }
                },
                {
                    "key": 213,
                    "doc_count": 52,
                    "max_status": {
                        "value": 28841.0
                    }
                },
                {
                    "key": 202,
                    "doc_count": 51,
                    "max_status": {
                        "value": 31026.0
                    }
                },
                {
                    "key": 216,
                    "doc_count": 51,
                    "max_status": {
                        "value": 31242.0
                    }
                },
                {
                    "key": 204,
                    "doc_count": 49,
                    "max_status": {
                        "value": 30802.0
                    }
                }
            ]
        }
    }
}
@bharath-techie
Copy link
Contributor

Hi @sandeshkr419,
Thanks for raising the issue.

We need an extensive design for the following features :

1. Auto detection of queries that can be solved via star tree at request / index / shard level

  • From query and aggregation parts of the user query - Parse the dimensions and filters on each dimensions
  • From aggregations part of the user query - Parse the required metrics

We need more details on how we can solve this for one query / aggregation and how this can be extended to other queries / aggregations easily. [ Extensibility of the approach to newer queries / aggregations ]

And secondly validations + decisions on whether star tree index can be solved for the given query and if so which star tree index to be used.

  • Since we have built indexing flows from ground up with support for multiple star tree indices per index, we need a high level plan/design on how search flows will decide on which star tree to use based on the configuration.

2. Segment level decisions

Similar to 'IndexOrDocValuesQuery', we might need 'OriginalOrStarTreeQuery' as there are certain scenarios where original query must be used.
For example,
- It could be based on 'cost'
- Based on factors like 'cardinality' [ say star tree index has too many documents, or star tree docs > total segment docs ]
- Deleted documents are present in segment etc

In this case, we can't take the decision right at request parsing stage as proposed in this doc as the aggregators to be used depends on the query used ( as star tree documents and segment documents differ , and also aggregators differ )
We might need to take decision for each leaf reader context based on the query or any other method where we need to take decisions of which aggregators used based on which query is used for the segment

3. Response correctness and scenarios

  • Star tree index can be used to solve nested aggregations, multiple aggregations in an efficient way as a single document contains keys of the buckets [dimensions] and resultant metrics of the keys.

We currently return nested response to users when users query with nested aggregations and we have to pass via multiple aggregators.
We can cover how we're handling the same with star tree query/aggregator. The response structure must be same.

  • Size
    Some of the aggregations like terms aggregation / multi terms aggregations returns the top n results as buckets based on size parameter in request ordered based on the doc count.

  • Ordering
    Most of the aggregations seems to be ordered based on doc count etc. We need to check the ordering and see how it works with single aggregations, multiple and nested aggregations.

We need a plan on how we are solving the same in star tree query flow using star tree index structures.

4. Star tree query algorithm

This can cover the actual algorithm that is used to traverse the star tree to solve queries and the limitations of the algorithms etc.

If possible, scorer and weight structure of the query which is used along with the algorithm.

Some of the things to cover :

  • We need to check how conversions work for the star tree fields and how we can query the same.

During indexing , dimensions are converted to long regardless of the original index field's data type ( similar to current fields as well in SortedNDV ).
Metrics are saved based on metric aggregator type ( 'sum', 'min', 'max' saves the values as double whereas count is stored as 'long' )

  • Details on how we query and combine the results of dimensions present as part of star tree and for dimensions not present as part of star tree.

Please raise issues for the above and more such broader issues that needs to be solved and lets close on the approaches for the same.

@mch2 mch2 removed the untriaged label Jul 24, 2024
@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented Jul 30, 2024

In my poc approach, I was trying to parse the request at coordinator/request level itself to appropriate star tree query/aggregator pair. Since cluster state is available on all nodes, and keep in mind introduction of more than 1 star tree for an index, it makes sense to do the parsing at shard level directly to avoid unnecessary transport traffic.

As @bharath-techie suggested I think we will have to introduce a new composite OriginalOrStarTreeQuery[StarTreeQuery(s), OriginalQuery] and OriginalOrStarTreeAggregator[StarTreeAggregator(s), OriginalAggregator], in that way we can take decision at a segment level [1/ deleted docs, 2/ doc_count field, 3/ any other cost estimation] to whether execute star tree query/aggregator or not.

Need to figure out next the entry point at shard level for query/aggregation rewrite in this case.

@sandeshkr419
Copy link
Contributor Author

Revisiting my POC changes, I started with resolving a single level (no sub) metric aggregation with/without a numeric terms query.

Example query:

{
    "query": {
        "term": {
            "status": 200
        }
    },
    "size": 0,
    "aggs": {
                        "sum_status": {
                            "sum": {
                                "field": "size"
                            }
                        }
                    }
}

I ditched using a separate aggregator class setup altogether and made use of existing metric aggregators so that I can utilize them entirely. However, I still kept the usage of OriginalOrStarTreeQuery to preserve al the information required to resolve a star tree query t a single place.

This made the flow of code much simpler. Here is a draft PR with the approach (raised against a private fork because of depending changes #14809 are still in review): sandeshkr419#227

@sandeshkr419 sandeshkr419 added the Roadmap:Search Project-wide roadmap label label Sep 11, 2024
@sandeshkr419 sandeshkr419 added the v2.18.0 Issues and PRs related to version 2.18.0 label Sep 11, 2024
@sandeshkr419 sandeshkr419 changed the title [Star Tree] Parse aggregation request to star tree query & star tree aggregation [Star Tree] Parse aggregation request to resolve via star tree data structure Sep 12, 2024
@sandeshkr419 sandeshkr419 changed the title [Star Tree] Parse aggregation request to resolve via star tree data structure [Star Tree][Search][RFC] Parse aggregation request to resolve via star tree data structure Sep 12, 2024
@sandeshkr419
Copy link
Contributor Author

Merged #15289

Other query shapes have separate issues opened for further discussion.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Nov 11, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap:Search Project-wide roadmap label Search Search query, autocomplete ...etc v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: New
Status: Done
Status: Done
Development

No branches or pull requests

3 participants