Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

Open
munnerz opened this issue May 27, 2021 · 1 comment

Comments

@munnerz
Copy link
Member

munnerz commented May 27, 2021

In the NodePublishVolume call, we have a defer that calls UnmanageVolume (and deletes metadata from the storage backend) if initial issuance fails:

// clean up after ourselves if provisioning fails.
// this is required because if publishing never succeeds, unpublish is not
// called which leaves files around (and we may continue to renew if so).
success := false
defer func() {
if !success {
ns.manager.UnmanageVolume(req.GetVolumeId())
_ = ns.mounter.Unmount(req.GetTargetPath())
_ = ns.store.RemoveVolume(req.GetVolumeId())
}
}()

If the driver is stopped during this initial 30s period, and the pod is also deleted whilst the driver is stopped, because the publish step never succeeded, the UnpublishVolume step will never be called in future.

Upon the driver starting up again, it will read the metadata.json file and then attempt to request the certificate for that pod again.

Because the pod no longer exists, the UnpublishVolume step will never be called and therefore the certificate data will never be cleaned up on disk (and the driver will continue to process renewals for the volume indefinitely, until an administrator manually cleans up the metadata file on disk and triggers a restart of the driver).

We should do whatever we can to avoid this state occurring, as it causes excessive churn on nodes, in the apiserver, and for signers.

One option would be, on startup of the driver, if metadata.json files that do not have a nextIssuanceTime set on them are found (which implies the issuance has never succeeded), delete this data/clean it up on disk and await NodePublishVolume being called again (for the case where the pod does still exist and is waiting to startup). There may be some edge cases we have not thought of however, whereby the pod does still exist and is provisioned and for some reason this field is not set. Though I don't think that is a possible state to get into...

@JoshVanL
Copy link
Contributor

We may want to also consider using a separate field in the metadata to represent this. Using netIssuanceTime is a bit indirect and surprising, whereas a dedicated mounted: true (not the best name) is far more clear and has little downside AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants