Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jan 29, 2025

What does this PR do?

This PR refactors the implementation to eliminate logical race conditions of the kubernetes_secrets provider alongside with brand new unit-tests.

Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.

To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:

  • Using a Kubernetes reflector (watch-based mechanism) is not an option because it would require listing and watching all secrets, which is often a non-starter for users.
  • Instead, we must maintain our own caching mechanism that periodically refreshes only the referenced Kubernetes secrets.

With this in mind, the provider behaviour is now as follows:

Cache Disabled Mode:

  • When caching is disabled, the provider simply reads secrets directly from the Kubernetes API server.

Cache Enabled Mode:

  • When caching is enabled, the provider stores secrets in a cache where entries expire based on the configured TTL (time-to-live) and a lastAccess field of each cache entry.
  • The provider has two primary actors: cache actor and fetch actor, each with well-defined responsibilities.

Cache Actor Responsibilities:

  1. Signal expiration of items: When a secret expires, the cache actor signals that a fetch should occur to reinsert the key into the cache, ensuring continued refreshing.
  2. Detect secret updates and signal changes: When the cache actor detects a secret value change, it signals the ContextProviderComm.
  3. Conditionally update lastAccess:
    • If the secret has changed, update lastAccess to prevent premature expiration and give the fetch actor time to pick up the new value.
    • In any other case, do not update lastAccess and let the entry "age" as it should.

Fetch Actor Responsibilities:

  1. Retrieve secrets from the cache:
    • If present, return the value.
    • If missing, fetch from the Kubernetes API.
  2. Insert fetched secrets into the cache if there isn't a more recent version of the secret already in it (can happen by the cache actor or a parallel fetch actor).
  3. Always update lastAccess when an entry is accessed to prevent unintended expiration.

Considerations:

  • No global locks: Store operations are the only part of the critical path, ensuring that neither cache nor fetch actors block each other.
  • Conditional updates: Since cache state can change between the time an actor reads and writes, all updates use conditional store operations that are part of the critical path.
  • Custom store implementation: The existing ExpirationCache from k8s.io/client-go/tools/cache does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.
  • Optimized memory management: The prior implementation copied the cache map on every update to prevent Golang map bucket retention. However, I believe this was a misunderstanding of Golang internals and a premature optimisation. If needed in the future, this can be revisited in a more controlled manner.

PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy

Why is it important?

This refactor significantly improves the correctness of the kubernetes_secrets provider by ensuring:

  • Secrets do not expire prematurely due to logical race conditions.
  • Updates are properly signaled to consuming components.
  • Performance is optimised with minimal locking and unnecessary memory allocations.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding changes to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the change-log tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This change does not introduce breaking changes but ensures that the kubernetes_secrets provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.

How to test this PR locally

  1. Run unit tests to validate the new caching behaviour:
    go test ./internal/pkg/composable/providers/kubernetessecrets/...

Related issues

@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 29, 2025
@pkoutsovasilis pkoutsovasilis self-assigned this Jan 29, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch 3 times, most recently from e001b10 to 9093b52 Compare January 29, 2025 09:08
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review January 30, 2025 07:05
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner January 30, 2025 07:05
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch from 9093b52 to 3e3788e Compare January 30, 2025 17:48
secretKey := tokens[3]

// Wait for the provider to be initialized
<-p.running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.running will be closed but we have no guarantees refreshCache goroutine was even started,
we should probably close this once we finished first iteration of refresh right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<-p.running is only here to make the any fetch wait for the p.client to be initialised. We don't need to wait for the first iteration to finish because during the first iteration our cache would be empty, keys are inserted to the cache by Fetch. Does that make sense? 🙂


if !p.config.DisableCache {
go p.updateSecrets(ctx, comm)
go p.refreshCache(ctx, comm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add comment as well here.
p.running will be closed but we have no guarantees refreshCache goroutine was even started,
we should probably close this once we finished first iteration of refresh right?
in case cache is disabled we can close it right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, <-p.running is only to make any fetch wait for the p.client to be initialised. Now we do close the p.running after we invoke the go p.refreshCache(ctx, comm), and during the first iteration cache will be empty so there is no extra safety added by closing it after it, right?

// no existing secret in the cache thus add it
return true
}
if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sd.apiFetchTime.After(existing.apiFetchTime) if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂

// when deleted from the map. More importantly working with a pointer makes the entry in the map bucket, that doesn't
// get deallocated, to utilise only 8 bytes on a 64-bit system.
type expirationCache struct {
sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we sure we need RWMutex, this does not add any performance gain and if write is frequent it can even result in a bad performance.
i'm leaning more towards normal Mutex, or we should benchmark to see benefits of having RWMutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the choice of RWMutex, was solely done on the expectation that this would more read-based as each secret value would be inserted once either during the refresh of the cache or during a fetch, and then everything will hit for the ttl. I mean having a cache should make us read-heavy otherwise on each write we would be also fetching from the API server, which kinda defeats the purpose? I can still switch to plain Mutex just let me know 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deletion race condition when fetching secret from the Kubernetes secrets provider cache
3 participants