Additional Search Related Metrics #5443

esatterwhite · 2024-09-24T17:08:51Z

Is your feature request related to a problem? Please describe.
We have been experiencing periodic slowness and spikes in resource usage on search node, and We can't seem to figure out why. There have been several ideas and theories postulated, but we do not really have a good way to understand what the search nodes are doing to confirm or deny them. Additionally, we are unable to correlate certain operations to slowness or resource spikes well enough to rule various theories out.

Describe the solution you'd like

Histogram of the number of splits utilized during the execution of a search query - We are trying to gain better insight into the number of splits being accessed per query. If we can assert that, in general search queries are targeting a high number of splits which can slow down search, we know we need to adjust what we are doing to get more docs per split.
Counter tracking the number of keys being evicted from the various caches, labeled by cache type. In our case , the split cache is (I think) the more important one. We routinely see fairly long periods of both high memory and CPU and the graph curve for them on a node is almost identical. Our split cache would appear to be full, and we have a decent hit ratio. But we aren't able to understand how frequently or how many items are being evicted from cache. Being able to correlate a high rate of cache evictions to slowness or high resource utilization in contrast to just a spike in cache misses it means we may need to expand how much disk is allocated for caching.
Histogram observing search execution time. Buckets ranging upward of 30sec. We have resorted to utilizing the robust monitoring infrastructure in our stack to figure out roughly how long a search query is taking. However, it shouldn't be expected of all users of quickwit to have that readily available or setup - quickwit should be able to report how long a search request is taking.

Describe alternatives you've considered
We rely on external monitoring tools to capture network timings and manually record the took value returned by the Elasticsearch API as an outward facing indicator.

The text was updated successfully, but these errors were encountered:

esatterwhite added the enhancement New feature or request label Sep 24, 2024

esatterwhite mentioned this issue Sep 24, 2024

Enhance metrics on the REST API #4076

Open

trinity-1686a mentioned this issue Sep 25, 2024

Add some additional search metrics #5447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Search Related Metrics #5443

Additional Search Related Metrics #5443

esatterwhite commented Sep 24, 2024 •

edited

Loading

Additional Search Related Metrics #5443

Additional Search Related Metrics #5443

Comments

esatterwhite commented Sep 24, 2024 • edited Loading

esatterwhite commented Sep 24, 2024 •

edited

Loading