Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Search Related Metrics #5443

Open
esatterwhite opened this issue Sep 24, 2024 · 0 comments
Open

Additional Search Related Metrics #5443

esatterwhite opened this issue Sep 24, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@esatterwhite
Copy link
Collaborator

esatterwhite commented Sep 24, 2024

Is your feature request related to a problem? Please describe.
We have been experiencing periodic slowness and spikes in resource usage on search node, and We can't seem to figure out why. There have been several ideas and theories postulated, but we do not really have a good way to understand what the search nodes are doing to confirm or deny them. Additionally, we are unable to correlate certain operations to slowness or resource spikes well enough to rule various theories out.

Describe the solution you'd like

  • Histogram of the number of splits utilized during the execution of a search query - We are trying to gain better insight into the number of splits being accessed per query. If we can assert that, in general search queries are targeting a high number of splits which can slow down search, we know we need to adjust what we are doing to get more docs per split.

  • Counter tracking the number of keys being evicted from the various caches, labeled by cache type. In our case , the split cache is (I think) the more important one. We routinely see fairly long periods of both high memory and CPU and the graph curve for them on a node is almost identical. Our split cache would appear to be full, and we have a decent hit ratio. But we aren't able to understand how frequently or how many items are being evicted from cache. Being able to correlate a high rate of cache evictions to slowness or high resource utilization in contrast to just a spike in cache misses it means we may need to expand how much disk is allocated for caching.

  • Histogram observing search execution time. Buckets ranging upward of 30sec. We have resorted to utilizing the robust monitoring infrastructure in our stack to figure out roughly how long a search query is taking. However, it shouldn't be expected of all users of quickwit to have that readily available or setup - quickwit should be able to report how long a search request is taking.

Describe alternatives you've considered
We rely on external monitoring tools to capture network timings and manually record the took value returned by the Elasticsearch API as an outward facing indicator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant