Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task/NVPE-10: Create NIM Account controller #289

Merged

Conversation

TomerFi
Copy link
Contributor

@TomerFi TomerFi commented Nov 7, 2024

Description

  • Added a controller reconciling for NIM accounts.
  • Added test cases for the controller including mocks for NVIDIA NIM API.
  • Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.
EDIT (Nov. 12):
Also tested against a live cluster, tests described in the following google doc.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 7, 2024

@TomerFi: This pull request references NVPE-10 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set.

In response to this:

Description

  • Added a controller reconciling for NIM accounts.
  • Added test cases for the controller including mocks for NVIDIA NIM API.
  • Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Nov 7, 2024

Hi @TomerFi. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

if err := r.Client.Get(ctx, req.NamespacedName, account); err != nil {
if k8serrors.IsNotFound(err) {
logger.V(1).Info("account deleted")
r.cleanupResources(ctx, req.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the resources affected by the cleanup have the Account as owner, I guess they get cleaned up automatically by k8s gc, so may not be needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 7929b53.

Copy link
Contributor Author

@TomerFi TomerFi Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, @lburgazzoli. I'm just realizing something. Yes, you were right. We didn't need to delete the resources when an account was deleted; k8s will handle this.
However, the function I removed in the above commit was also used when the API key Secret was deleted, and this still needs to be handled.

So, I need to bring back the cleanup utility function for cleanups, but I only use it when the secret is deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(re)Resolved in 15dccb2.

meta.SetStatusCondition(&targetStatus.Conditions, makePullSecretUnknownCondition(account.Generation, "not reconciled yet"))

defer func() {
r.updateStatus(ctx, req.NamespacedName, *targetStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the update status is updated every time the function return, it seems there is a case where the conditions are reporting wrong information. i.e. in the case the function returns as a consequence of utils.GetAvailableNimRuntimes() reporting an error, then the condition would report i.e. "api key secret not available", which may not be true.

not sure if that is expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 8f87716.

},
}

if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, cmap, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is applicable, but eventually you can use Server Side Apply instead of CreateOrUpdate which may reduce calls and conflicts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in de00250.

@@ -84,6 +84,8 @@ func init() { //nolint:gochecknoinits //reason this way we ensure schemes are al
// +kubebuilder:rbac:groups=authorino.kuadrant.io,resources=authconfigs,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=datasciencecluster.opendatahub.io,resources=datascienceclusters,verbs=get;list;watch
// +kubebuilder:rbac:groups=dscinitialization.opendatahub.io,resources=dscinitializations,verbs=get;list;watch
// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required?
We'e not creating, patching, or deleting accounts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the status perms, fixed here: 9743066.

We now have:

// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update
// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts/status,verbs=get;list;watch;update

I think this should suffice; please resolve if you agree or comment if you do not. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am resolving this. But please double check the logs of the operator after it runs for a few hours, and make sure there are not issues related to permissions in the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't forget to test updating and deleting Account CR.
These events could require additional permissions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieshenzh You were right, finalzier permissions are also required:

cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieshenzh I documented the tests I performed in this doc (updated the PR body).

@TomerFi TomerFi marked this pull request as ready for review November 8, 2024 01:50

// createOwnerReferenceCfg is used to create an owner reference config to use with server side apply
func (r *NimAccountReconciler) createOwnerReferenceCfg(account *v1.Account) *ssametav1.OwnerReferenceApplyConfiguration {
gvks, _, _ := r.Scheme().ObjectKinds(account)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given the usage path, at this point probably the GVK of the account should be set so it is probably not needed to resolve it via the scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is not a blocker, just a suggestion but it seems there are some issues that prevent to implement it as described

Copy link
Contributor Author

@TomerFi TomerFi Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment referring this in 53cf8fa.

Basically, what happens is that while working as expected in a live (kind) cluster, when running using envtest, the object's TypeMeta is being stripped. See the attached screenshots:

Live
image

EnvTest
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was tested with the latest envtest (0.19.1) using kube 1.26, 1.29, and 1.30.

@Jooho
Copy link
Contributor

Jooho commented Nov 8, 2024

/ok-to-test

@TomerFi
Copy link
Contributor Author

TomerFi commented Nov 9, 2024

/retest

@TomerFi TomerFi mentioned this pull request Nov 10, 2024
3 tasks
controllers/nim_account_controller.go Show resolved Hide resolved
controllers/nim_account_controller.go Outdated Show resolved Hide resolved
controllers/nim_account_controller.go Show resolved Hide resolved
hack/verify_nvidia_nim_api.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
controllers/nim_account_controller.go Show resolved Hide resolved
controllers/nim_account_controller.go Show resolved Hide resolved
logger.Error(err, msg)
meta.SetStatusCondition(&targetStatus.Conditions, makeAccountFailureCondition(account.Generation, msg))
meta.SetStatusCondition(&targetStatus.Conditions, makeApiKeyFailureCondition(account.Generation, msg))
r.cleanupResources(ctx, account)
Copy link
Contributor

@israel-hdez israel-hdez Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for the cleanup in case of errors?

I have the thought that this can momentarily break a working environment during a controller resync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't clean up for all errors, only for errors relating to the API key validation. The motivation is to remove the model info from the cluster in case of a failed validation. We can remove this if you prefer. I don't think it was a requirement. It may be something that we did in TP and may no longer be required here since we use status conditions.

@mpaulgreen, @xieshenzh, Any thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieshenzh wrote the code in the cronob. So, this is the code that deleted the associated resources - https://github.com/opendatahub-io/odh-dashboard/pull/2959/files#diff-654ddbcfb120f3fa423302828fb6b0e72b2a680a025a7a09bc24ccfc4ea2abd9R103-R110. I am not sure why we want to keep the associated configmap and template if api key is invalid? There could be downstream impact too... @yzhao583 @@olavtar any issues if we dont remove the secret, cm and template? But @israel-hdez this is a very common pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we want to keep the associated configmap and template if api key is invalid?

If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.

AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.

Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.

So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.

We clean up in four scenarios, all of which qualify as a failure to validate the API key:

  1. We failed to fetch the Secret encapsulating the API key; see here.
  2. We got the Secret, but it doesn't have the required key; see here.
  3. We failed to fetch NIM's available runtimes; in this case, we don't have runtime info to validate the key with; see here.
  4. The actual validation fail; see here.

AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.

Legitimate concern. Perhaps we should reconsider the cleanups.

Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.

The controller only returns an error for scenarios 2-4 from the ones described above. If the Secret is not found, no error is returned; see here.

So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.

I think we can live with that, but I would be happy if the dashboard people could comment on this, too.
@olavtar @andrewballantyne

controllers/utils/nim.go Outdated Show resolved Hide resolved
@TomerFi
Copy link
Contributor Author

TomerFi commented Nov 13, 2024

/retest

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 13, 2024

@TomerFi: This pull request references NVPE-10 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set.

In response to this:

Description

  • Added a controller reconciling for NIM accounts.
  • Added test cases for the controller including mocks for NVIDIA NIM API.
  • Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.
EDIT (Nov. 12):
Also tested against a live cluster, tests described in the following google doc.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

@TomerFi
Copy link
Contributor Author

TomerFi commented Nov 14, 2024

/retest

Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
@israel-hdez
Copy link
Contributor

/lgtm
/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit a0c3ec4 into opendatahub-io:main Nov 14, 2024
5 checks passed
@TomerFi TomerFi deleted the nim-account-controller branch November 14, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants