-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Fix logical race conditions in kubernetes_secrets provider #6623
base: main
Are you sure you want to change the base?
[k8s] Fix logical race conditions in kubernetes_secrets provider #6623
Conversation
e001b10
to
9093b52
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
9093b52
to
3e3788e
Compare
Quality Gate passedIssues Measures |
secretKey := tokens[3] | ||
|
||
// Wait for the provider to be initialized | ||
<-p.running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.running will be closed but we have no guarantees refreshCache
goroutine was even started,
we should probably close this once we finished first iteration of refresh right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<-p.running
is only here to make the any fetch wait for the p.client to be initialised. We don't need to wait for the first iteration to finish because during the first iteration our cache would be empty, keys are inserted to the cache by Fetch. Does that make sense? 🙂
|
||
if !p.config.DisableCache { | ||
go p.updateSecrets(ctx, comm) | ||
go p.refreshCache(ctx, comm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add comment as well here.
p.running will be closed but we have no guarantees refreshCache goroutine was even started,
we should probably close this once we finished first iteration of refresh right?
in case cache is disabled we can close it right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, <-p.running
is only to make any fetch wait for the p.client
to be initialised. Now we do close the p.running
after we invoke the go p.refreshCache(ctx, comm)
, and during the first iteration cache will be empty so there is no extra safety added by closing it after it, right?
// no existing secret in the cache thus add it | ||
return true | ||
} | ||
if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sd.apiFetchTime.After(existing.apiFetchTime)
if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂
// when deleted from the map. More importantly working with a pointer makes the entry in the map bucket, that doesn't | ||
// get deallocated, to utilise only 8 bytes on a 64-bit system. | ||
type expirationCache struct { | ||
sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we sure we need RWMutex, this does not add any performance gain and if write is frequent it can even result in a bad performance.
i'm leaning more towards normal Mutex, or we should benchmark to see benefits of having RWMutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the choice of RWMutex, was solely done on the expectation that this would more read-based as each secret value would be inserted once either during the refresh of the cache or during a fetch, and then everything will hit for the ttl. I mean having a cache should make us read-heavy otherwise on each write we would be also fetching from the API server, which kinda defeats the purpose? I can still switch to plain Mutex just let me know 🙂
What does this PR do?
This PR refactors the implementation to eliminate logical race conditions of the
kubernetes_secrets
provider alongside with brand new unit-tests.Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.
To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:
With this in mind, the provider behaviour is now as follows:
Cache Disabled Mode:
Cache Enabled Mode:
Cache Actor Responsibilities:
lastAccess
:lastAccess
to prevent premature expiration and give the fetch actor time to pick up the new value.lastAccess
and let the entry "age" as it should.Fetch Actor Responsibilities:
lastAccess
when an entry is accessed to prevent unintended expiration.Considerations:
ExpirationCache
fromk8s.io/client-go/tools/cache
does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy
Why is it important?
This refactor significantly improves the correctness of the
kubernetes_secrets
provider by ensuring:Checklist
./changelog/fragments
using the change-log toolDisruptive User Impact
This change does not introduce breaking changes but ensures that the
kubernetes_secrets
provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.How to test this PR locally
go test ./internal/pkg/composable/providers/kubernetessecrets/...
Related issues