[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

pkoutsovasilis · 2025-01-29T07:18:01Z

What does this PR do?

This PR refactors the implementation to eliminate logical race conditions of the kubernetes_secrets provider alongside with brand new unit-tests.

Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.

To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:

Using a Kubernetes reflector (watch-based mechanism) is not an option because it would require listing and watching all secrets, which is often a non-starter for users.
Instead, we must maintain our own caching mechanism that periodically refreshes only the referenced Kubernetes secrets.

With this in mind, the provider behaviour is now as follows:

Cache Disabled Mode:

When caching is disabled, the provider simply reads secrets directly from the Kubernetes API server.

Cache Enabled Mode:

When caching is enabled, the provider stores secrets in a cache where entries expire based on the configured TTL (time-to-live) and a lastAccess field of each cache entry.
The provider has two primary actors: cache actor and fetch actor, each with well-defined responsibilities.

Cache Actor Responsibilities:

Signal expiration of items: When a secret expires, the cache actor signals that a fetch should occur to reinsert the key into the cache, ensuring continued refreshing.
Detect secret updates and signal changes: When the cache actor detects a secret value change, it signals the ContextProviderComm.
Conditionally update lastAccess:
- If the secret has changed, update lastAccess to prevent premature expiration and give the fetch actor time to pick up the new value.
- In any other case, do not update lastAccess and let the entry "age" as it should.

Fetch Actor Responsibilities:

Retrieve secrets from the cache:
- If present, return the value.
- If missing, fetch from the Kubernetes API.
Insert fetched secrets into the cache if there isn't a more recent version of the secret already in it (can happen by the cache actor or a parallel fetch actor).
Always update lastAccess when an entry is accessed to prevent unintended expiration.

Considerations:

No global locks: Store operations are the only part of the critical path, ensuring that neither cache nor fetch actors block each other.
Conditional updates: Since cache state can change between the time an actor reads and writes, all updates use conditional store operations that are part of the critical path.
Custom store implementation: The existing ExpirationCache from k8s.io/client-go/tools/cache does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.
Optimized memory management: The prior implementation copied the cache map on every update to prevent Golang map bucket retention. However, I believe this was a misunderstanding of Golang internals and a premature optimisation. If needed in the future, this can be revisited in a more controlled manner.

PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy

Why is it important?

This refactor significantly improves the correctness of the kubernetes_secrets provider by ensuring:

Secrets do not expire prematurely due to logical race conditions.
Updates are properly signaled to consuming components.
Performance is optimised with minimal locking and unnecessary memory allocations.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding changes to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the change-log tool
I have added an integration test or an E2E test

Disruptive User Impact

This change does not introduce breaking changes but ensures that the kubernetes_secrets provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.

How to test this PR locally

Run unit tests to validate the new caching behaviour:

go test ./internal/pkg/composable/providers/kubernetessecrets/...

Related issues

Closes Deletion race condition when fetching secret from the Kubernetes secrets provider cache #6340

elasticmachine · 2025-01-30T07:05:44Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

…vider

elastic-sonarqube · 2025-01-30T18:42:38Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
98.1% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

michalpristas · 2025-01-31T08:34:10Z

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

+	secretKey := tokens[3]
+
+	// Wait for the provider to be initialized
+	<-p.running


p.running will be closed but we have no guarantees refreshCache goroutine was even started,
we should probably close this once we finished first iteration of refresh right?

<-p.running is only here to make the any fetch wait for the p.client to be initialised. We don't need to wait for the first iteration to finish because during the first iteration our cache would be empty, keys are inserted to the cache by Fetch. Does that make sense? 🙂

michalpristas · 2025-01-31T08:35:16Z

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go


 	if !p.config.DisableCache {
-		go p.updateSecrets(ctx, comm)
+		go p.refreshCache(ctx, comm)


will add comment as well here.
p.running will be closed but we have no guarantees refreshCache goroutine was even started,
we should probably close this once we finished first iteration of refresh right?
in case cache is disabled we can close it right away.

Again, <-p.running is only to make any fetch wait for the p.client to be initialised. Now we do close the p.running after we invoke the go p.refreshCache(ctx, comm), and during the first iteration cache will be empty so there is no extra safety added by closing it after it, right?

michalpristas · 2025-01-31T08:43:36Z

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

+			// no existing secret in the cache thus add it
+			return true
+		}
+		if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) {


could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was

sd.apiFetchTime.After(existing.apiFetchTime) if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂

michalpristas · 2025-01-31T10:18:28Z

internal/pkg/composable/providers/kubernetessecrets/expiration_cache.go

+// when deleted from the map. More importantly working with a pointer makes the entry in the map bucket, that doesn't
+// get deallocated, to utilise only 8 bytes on a 64-bit system.
+type expirationCache struct {
+	sync.RWMutex


we sure we need RWMutex, this does not add any performance gain and if write is frequent it can even result in a bad performance.
i'm leaning more towards normal Mutex, or we should benchmark to see benefits of having RWMutex.

the choice of RWMutex, was solely done on the expectation that this would more read-based as each secret value would be inserted once either during the refresh of the cache or during a fetch, and then everything will hit for the ttl. I mean having a cache should make us read-heavy otherwise on each write we would be also fetching from the API server, which kinda defeats the purpose? I can still switch to plain Mutex just let me know 🙂

pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 29, 2025

pkoutsovasilis self-assigned this Jan 29, 2025

pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch 3 times, most recently from e001b10 to 9093b52 Compare January 29, 2025 09:08

pkoutsovasilis marked this pull request as ready for review January 30, 2025 07:05

pkoutsovasilis requested a review from a team as a code owner January 30, 2025 07:05

pkoutsovasilis requested review from kaanyalti and pchila January 30, 2025 07:05

pkoutsovasilis added 2 commits January 30, 2025 19:45

fix: refactor kubernetes_secrets provider to eliminate race conditions

a549728

fix: add changelog fragment and unit-tests for kubernetes_secrets pro…

3e3788e

…vider

pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch from 9093b52 to 3e3788e Compare January 30, 2025 17:48

pkoutsovasilis requested a review from swiatekm January 30, 2025 17:53

michalpristas approved these changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

pkoutsovasilis commented Jan 29, 2025 •

edited

Loading

elasticmachine commented Jan 30, 2025

elastic-sonarqube bot commented Jan 30, 2025

michalpristas Jan 31, 2025

pkoutsovasilis Jan 31, 2025

michalpristas Jan 31, 2025

pkoutsovasilis Jan 31, 2025

michalpristas Jan 31, 2025

pkoutsovasilis Jan 31, 2025

michalpristas Jan 31, 2025

pkoutsovasilis Jan 31, 2025

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

Are you sure you want to change the base?

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

Conversation

pkoutsovasilis commented Jan 29, 2025 • edited Loading

What does this PR do?

Cache Disabled Mode:

Cache Enabled Mode:

Cache Actor Responsibilities:

Fetch Actor Responsibilities:

Considerations:

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

elasticmachine commented Jan 30, 2025

elastic-sonarqube bot commented Jan 30, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jan 29, 2025 •

edited

Loading