-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task/NVPE-10: Create NIM Account controller #289
Task/NVPE-10: Create NIM Account controller #289
Conversation
@TomerFi: This pull request references NVPE-10 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hi @TomerFi. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
if err := r.Client.Get(ctx, req.NamespacedName, account); err != nil { | ||
if k8serrors.IsNotFound(err) { | ||
logger.V(1).Info("account deleted") | ||
r.cleanupResources(ctx, req.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that the resources affected by the cleanup have the Account
as owner, I guess they get cleaned up automatically by k8s gc, so may not be needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in 7929b53.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, @lburgazzoli. I'm just realizing something. Yes, you were right. We didn't need to delete the resources when an account was deleted; k8s will handle this.
However, the function I removed in the above commit was also used when the API key Secret was deleted, and this still needs to be handled.
So, I need to bring back the cleanup utility function for cleanups, but I only use it when the secret is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(re)Resolved in 15dccb2.
meta.SetStatusCondition(&targetStatus.Conditions, makePullSecretUnknownCondition(account.Generation, "not reconciled yet")) | ||
|
||
defer func() { | ||
r.updateStatus(ctx, req.NamespacedName, *targetStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the update status is updated every time the function return, it seems there is a case where the conditions are reporting wrong information. i.e. in the case the function returns as a consequence of utils.GetAvailableNimRuntimes()
reporting an error, then the condition would report i.e. "api key secret not available", which may not be true.
not sure if that is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in 8f87716.
}, | ||
} | ||
|
||
if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, cmap, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is applicable, but eventually you can use Server Side Apply instead of CreateOrUpdate
which may reduce calls and conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in de00250.
@@ -84,6 +84,8 @@ func init() { //nolint:gochecknoinits //reason this way we ensure schemes are al | |||
// +kubebuilder:rbac:groups=authorino.kuadrant.io,resources=authconfigs,verbs=get;list;watch;create;update;patch;delete | |||
// +kubebuilder:rbac:groups=datasciencecluster.opendatahub.io,resources=datascienceclusters,verbs=get;list;watch | |||
// +kubebuilder:rbac:groups=dscinitialization.opendatahub.io,resources=dscinitializations,verbs=get;list;watch | |||
// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required?
We'e not creating, patching, or deleting accounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the status perms, fixed here: 9743066.
We now have:
// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update
// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts/status,verbs=get;list;watch;update
I think this should suffice; please resolve if you agree or comment if you do not. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am resolving this. But please double check the logs of the operator after it runs for a few hours, and make sure there are not issues related to permissions in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And don't forget to test updating and deleting Account CR.
These events could require additional permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xieshenzh You were right, finalzier permissions are also required:
cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xieshenzh I documented the tests I performed in this doc (updated the PR body).
|
||
// createOwnerReferenceCfg is used to create an owner reference config to use with server side apply | ||
func (r *NimAccountReconciler) createOwnerReferenceCfg(account *v1.Account) *ssametav1.OwnerReferenceApplyConfiguration { | ||
gvks, _, _ := r.Scheme().ObjectKinds(account) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: given the usage path, at this point probably the GVK of the account should be set so it is probably not needed to resolve it via the scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this is not a blocker, just a suggestion but it seems there are some issues that prevent to implement it as described
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment referring this in 53cf8fa.
Basically, what happens is that while working as expected in a live (kind) cluster, when running using envtest, the object's TypeMeta is being stripped. See the attached screenshots:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was tested with the latest envtest (0.19.1) using kube 1.26, 1.29, and 1.30.
/ok-to-test |
/retest |
logger.Error(err, msg) | ||
meta.SetStatusCondition(&targetStatus.Conditions, makeAccountFailureCondition(account.Generation, msg)) | ||
meta.SetStatusCondition(&targetStatus.Conditions, makeApiKeyFailureCondition(account.Generation, msg)) | ||
r.cleanupResources(ctx, account) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for the cleanup in case of errors?
I have the thought that this can momentarily break a working environment during a controller resync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't clean up for all errors, only for errors relating to the API key validation. The motivation is to remove the model info from the cluster in case of a failed validation. We can remove this if you prefer. I don't think it was a requirement. It may be something that we did in TP and may no longer be required here since we use status conditions.
@mpaulgreen, @xieshenzh, Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xieshenzh wrote the code in the cronob. So, this is the code that deleted the associated resources - https://github.com/opendatahub-io/odh-dashboard/pull/2959/files#diff-654ddbcfb120f3fa423302828fb6b0e72b2a680a025a7a09bc24ccfc4ea2abd9R103-R110. I am not sure why we want to keep the associated configmap and template if api key is invalid? There could be downstream impact too... @yzhao583 @@olavtar any issues if we dont remove the secret, cm and template? But @israel-hdez this is a very common pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why we want to keep the associated configmap and template if api key is invalid?
If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if
block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.
AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.
Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.
So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.
We clean up in four scenarios, all of which qualify as a failure to validate the API key:
- We failed to fetch the Secret encapsulating the API key; see here.
- We got the Secret, but it doesn't have the required key; see here.
- We failed to fetch NIM's available runtimes; in this case, we don't have runtime info to validate the key with; see here.
- The actual validation fail; see here.
AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.
Legitimate concern. Perhaps we should reconsider the cleanups.
Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.
The controller only returns an error for scenarios 2-4 from the ones described above. If the Secret is not found, no error is returned; see here.
So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.
I think we can live with that, but I would be happy if the dashboard people could comment on this, too.
@olavtar @andrewballantyne
c777acb
to
4d86c0f
Compare
/retest |
@TomerFi: This pull request references NVPE-10 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
/retest |
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
ecae6b7
to
2377baf
Compare
/lgtm |
Description
Jira: https://issues.redhat.com/browse/NVPE-10
How Has This Been Tested?
Tested manully on my station usign the script and helper code.
EDIT (Nov. 12):
Also tested against a live cluster, tests described in the following google doc.
Merge criteria: