-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"counter cannot decrease in value" panic when bloom filtering is applied #13300
Comments
I fired up tracing and looked into a few queries with this issue. The traces are pretty huge but eyeballing it, this error consistently occurs in FilterChunkRefs calls that have at least one resultsCache hit. This lead me to run an experiment of setting: bloom_gateway:
client:
cache_results: false And I am no longer able to reproduce this error. So now my next question is - is this a bug, or do I simply have the bloom gateway results cache misconfigured? The Loki chart does not/did not have an obvious way to configure the |
I noticed we had configured the results cache to the bloom gateway to point at the chunks cache, so I updated that. bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc to bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-results-cache.observability.svc The error came back, so I started up an entirely different results cache (basically just cloned the results-cache statefulset with a new name/set of matching labels) bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-bloom-results-cache.observability.svc and am still getting the above error. So it looks like this is somehow related to having results cache enabled but it's not clear if it's a config problem or a bug. I've turned off bloom gateway results caching for now. |
I'm getting the same issue, on 3.1.0. Full trace
Disabling results_cache also fixes the issue for me. |
I' experiencing the same issue on 3.2.0. Yeah, and the true is that configuring the result cache for bloom gateway is not obvious in the helm chart. |
Same issue on 3.3.0. |
I disabled results_cache in the bloom-gateway, but still get a panic in the index-gateway. Unless I don't use the bloom filter.
|
Describe the bug
In the index-gateway,
bloomquerier.FilterChunkRefs
appears to panic because more "postFilter" chunks are returned than "preFiltered" chunks. The actual panic is in the prometheuscounter.Add
call, which panics if the value passed to it is less than 0.With debug logging enabled, I am able to see that
preFilterChunks
is sometimes smaller thanpostFilterChunks
. Glancing at the code, the panic occurs whenfilteredChunks
is computed and the value is < 0 and added to the prometheus counter. Here are some examples of FilterChunkRefs calls that appear to return < 0 filteredChunks values.This causes the query to fail but doesn't occur consistently.
To Reproduce
We're running the latest pre-release build for 3.1.0:
k208-ede6941
- was also able to reproduce this issue in the last releasek207
.Here's a query we're running that triggers this. It only occurs when we're searching time periods that are covered by bloom filters - so most recent data doesn't seem to trigger the issue, but if I run a query from
now-48h
tonow-47h
I can repro this.Expected behavior
I would expect this query to run reliably, leveraging the bloom filters to filter chunks that aren't needed in the search.
Environment:
Screenshots, Promtail config, or terminal output
here is our loki config for reference:
The text was updated successfully, but these errors were encountered: