Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

munnerz · 2021-05-27T15:22:29Z

In the NodePublishVolume call, we have a defer that calls UnmanageVolume (and deletes metadata from the storage backend) if initial issuance fails:

csi-lib/driver/nodeserver.go

Lines 51 to 61 in 031cdcd

    
           // clean up after ourselves if provisioning fails. 
        
           // this is required because if publishing never succeeds, unpublish is not 
        
           // called which leaves files around (and we may continue to renew if so). 
        
           success := false 
        
           defer func() { 
        
           	if !success { 
        
           		ns.manager.UnmanageVolume(req.GetVolumeId()) 
        
           		_ = ns.mounter.Unmount(req.GetTargetPath()) 
        
           		_ = ns.store.RemoveVolume(req.GetVolumeId()) 
        
           	} 
        
           }()

If the driver is stopped during this initial 30s period, and the pod is also deleted whilst the driver is stopped, because the publish step never succeeded, the UnpublishVolume step will never be called in future.

Upon the driver starting up again, it will read the metadata.json file and then attempt to request the certificate for that pod again.

Because the pod no longer exists, the UnpublishVolume step will never be called and therefore the certificate data will never be cleaned up on disk (and the driver will continue to process renewals for the volume indefinitely, until an administrator manually cleans up the metadata file on disk and triggers a restart of the driver).

We should do whatever we can to avoid this state occurring, as it causes excessive churn on nodes, in the apiserver, and for signers.

One option would be, on startup of the driver, if metadata.json files that do not have a nextIssuanceTime set on them are found (which implies the issuance has never succeeded), delete this data/clean it up on disk and await NodePublishVolume being called again (for the case where the pod does still exist and is waiting to startup). There may be some edge cases we have not thought of however, whereby the pod does still exist and is provisioned and for some reason this field is not set. Though I don't think that is a possible state to get into...

The text was updated successfully, but these errors were encountered:

JoshVanL · 2021-05-27T15:35:11Z

We may want to also consider using a separate field in the metadata to represent this. Using netIssuanceTime is a bit indirect and surprising, whereas a dedicated mounted: true (not the best name) is far more clear and has little downside AFAIK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

munnerz commented May 27, 2021

JoshVanL commented May 27, 2021

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases #8

Comments

munnerz commented May 27, 2021

JoshVanL commented May 27, 2021