Task/NVPE-10: Create NIM Account controller #289

TomerFi · 2024-11-07T08:02:29Z

Description

Added a controller reconciling for NIM accounts.
Added test cases for the controller including mocks for NVIDIA NIM API.
Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.
EDIT (Nov. 12):
Also tested against a live cluster, tests described in the following google doc.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci-robot · 2024-11-07T08:02:32Z

@TomerFi: This pull request references NVPE-10 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set.

In response to this:

Description

Added a controller reconciling for NIM accounts.

Added test cases for the controller including mocks for NVIDIA NIM API.

Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.

Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-11-07T08:02:40Z

Hi @TomerFi. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

lburgazzoli · 2024-11-07T09:04:17Z

controllers/nim_account_controller.go

+	if err := r.Client.Get(ctx, req.NamespacedName, account); err != nil {
+		if k8serrors.IsNotFound(err) {
+			logger.V(1).Info("account deleted")
+			r.cleanupResources(ctx, req.Namespace)


it seems that the resources affected by the cleanup have the Account as owner, I guess they get cleaned up automatically by k8s gc, so may not be needed ?

Resolved in 7929b53.

You know what, @lburgazzoli. I'm just realizing something. Yes, you were right. We didn't need to delete the resources when an account was deleted; k8s will handle this.
However, the function I removed in the above commit was also used when the API key Secret was deleted, and this still needs to be handled.

So, I need to bring back the cleanup utility function for cleanups, but I only use it when the secret is deleted.

(re)Resolved in 15dccb2.

lburgazzoli · 2024-11-07T09:09:36Z

controllers/nim_account_controller.go

+	meta.SetStatusCondition(&targetStatus.Conditions, makePullSecretUnknownCondition(account.Generation, "not reconciled yet"))
+
+	defer func() {
+		r.updateStatus(ctx, req.NamespacedName, *targetStatus)


given the update status is updated every time the function return, it seems there is a case where the conditions are reporting wrong information. i.e. in the case the function returns as a consequence of utils.GetAvailableNimRuntimes() reporting an error, then the condition would report i.e. "api key secret not available", which may not be true.

not sure if that is expected

Resolved in 8f87716.

lburgazzoli · 2024-11-07T09:14:46Z

controllers/nim_account_controller.go

+		},
+	}
+
+	if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, cmap, func() error {


not sure if it is applicable, but eventually you can use Server Side Apply instead of CreateOrUpdate which may reduce calls and conflicts

Resolved in de00250.

TomerFi · 2024-11-07T23:33:01Z

main.go

@@ -84,6 +84,8 @@ func init() { //nolint:gochecknoinits //reason this way we ensure schemes are al
 // +kubebuilder:rbac:groups=authorino.kuadrant.io,resources=authconfigs,verbs=get;list;watch;create;update;patch;delete
 // +kubebuilder:rbac:groups=datasciencecluster.opendatahub.io,resources=datascienceclusters,verbs=get;list;watch
 // +kubebuilder:rbac:groups=dscinitialization.opendatahub.io,resources=dscinitializations,verbs=get;list;watch
+// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update


Is this required?
We'e not creating, patching, or deleting accounts.

I missed the status perms, fixed here: 9743066.

We now have:

// +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts,verbs=get;list;watch;update // +kubebuilder:rbac:groups=nim.opendatahub.io,resources=accounts/status,verbs=get;list;watch;update

I think this should suffice; please resolve if you agree or comment if you do not. 😄

I am resolving this. But please double check the logs of the operator after it runs for a few hours, and make sure there are not issues related to permissions in the logs.

And don't forget to test updating and deleting Account CR.
These events could require additional permissions.

@xieshenzh You were right, finalzier permissions are also required:

cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on

@xieshenzh I documented the tests I performed in this doc (updated the PR body).

lburgazzoli · 2024-11-08T07:10:49Z

controllers/nim_account_controller.go

+
+// createOwnerReferenceCfg is used to create an owner reference config to use with server side apply
+func (r *NimAccountReconciler) createOwnerReferenceCfg(account *v1.Account) *ssametav1.OwnerReferenceApplyConfiguration {
+	gvks, _, _ := r.Scheme().ObjectKinds(account)


nit: given the usage path, at this point probably the GVK of the account should be set so it is probably not needed to resolve it via the scheme.

note: this is not a blocker, just a suggestion but it seems there are some issues that prevent to implement it as described

Added a comment referring this in 53cf8fa.

Basically, what happens is that while working as expected in a live (kind) cluster, when running using envtest, the object's TypeMeta is being stripped. See the attached screenshots:

Live

EnvTest

This was tested with the latest envtest (0.19.1) using kube 1.26, 1.29, and 1.30.

Jooho · 2024-11-08T23:44:54Z

/ok-to-test

TomerFi · 2024-11-09T03:20:39Z

/retest

controllers/constants/constants.go

controllers/nim_account_controller.go

hack/verify_nvidia_nim_api.go

main.go

controllers/nim_account_controller.go

israel-hdez · 2024-11-12T01:01:48Z

controllers/nim_account_controller.go

+		logger.Error(err, msg)
+		meta.SetStatusCondition(&targetStatus.Conditions, makeAccountFailureCondition(account.Generation, msg))
+		meta.SetStatusCondition(&targetStatus.Conditions, makeApiKeyFailureCondition(account.Generation, msg))
+		r.cleanupResources(ctx, account)


What is the motivation for the cleanup in case of errors?

I have the thought that this can momentarily break a working environment during a controller resync.

We don't clean up for all errors, only for errors relating to the API key validation. The motivation is to remove the model info from the cluster in case of a failed validation. We can remove this if you prefer. I don't think it was a requirement. It may be something that we did in TP and may no longer be required here since we use status conditions.

@mpaulgreen, @xieshenzh, Any thoughts?

@xieshenzh wrote the code in the cronob. So, this is the code that deleted the associated resources - https://github.com/opendatahub-io/odh-dashboard/pull/2959/files#diff-654ddbcfb120f3fa423302828fb6b0e72b2a680a025a7a09bc24ccfc4ea2abd9R103-R110. I am not sure why we want to keep the associated configmap and template if api key is invalid? There could be downstream impact too... @yzhao583 @@olavtar any issues if we dont remove the secret, cm and template? But @israel-hdez this is a very common pattern.

I am not sure why we want to keep the associated configmap and template if api key is invalid?

If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.

AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.

Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.

So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.

If the API Key is invalid, I'm OK with the clean-up. Although, my question is general to the clean-up rather than specific to this if block; i.e. there are other cleanups before this one. And, also, I'm thinking mainly about re-syncs.

We clean up in four scenarios, all of which qualify as a failure to validate the API key:

We failed to fetch the Secret encapsulating the API key; see here.

We got the Secret, but it doesn't have the required key; see here.

We failed to fetch NIM's available runtimes; in this case, we don't have runtime info to validate the key with; see here.

The actual validation fail; see here.

AFAIK, the controller will do a re-sync (i.e. it will run a reconcile cycle) automatically after a 10-hour period even if nothing has changed in the cluster. As this reconcile is using external services, a re-sync may encounter temporary issues (e.g. API Key validation, or fetching runtimes returning 500 status code, or the request timing out). Such temporary errors will lead to a clean-up, despite everything is still valid.

Legitimate concern. Perhaps we should reconsider the cleanups.

Since the controller is returning an error, controller-framework would trigger another reconcile quickly which may re-create again all resources if it succeeds. The small window of the deleted resources is my concern. I haven't seen how these resources are used later.

The controller only returns an error for scenarios 2-4 from the ones described above. If the Secret is not found, no error is returned; see here.

So, if you don't see an issue with the missing resources for a ¿short? window of time, then I'm good with the clean-up.

I think we can live with that, but I would be happy if the dashboard people could comment on this, too.
@olavtar @andrewballantyne

controllers/utils/nim.go

TomerFi · 2024-11-13T02:06:56Z

/retest

openshift-ci-robot · 2024-11-13T02:34:13Z

@TomerFi: This pull request references NVPE-10 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.18.0" version, but no target version was set.

In response to this:

Description

Added a controller reconciling for NIM accounts.

Added test cases for the controller including mocks for NVIDIA NIM API.

Added a "hack" script for troubelshooting NVIDIA API access.

Jira: https://issues.redhat.com/browse/NVPE-10

How Has This Been Tested?

Tested manully on my station usign the script and helper code.
EDIT (Nov. 12):
Also tested against a live cluster, tests described in the following google doc.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.

Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

israel-hdez

/lgtm
/hold

TomerFi · 2024-11-14T14:46:04Z

/retest

Signed-off-by: Tomer Figenblat <[email protected]>

israel-hdez · 2024-11-14T18:22:42Z

/lgtm
/unhold

openshift-ci-robot added the jira/valid-reference label Nov 7, 2024

openshift-ci bot added the do-not-merge/work-in-progress label Nov 7, 2024

openshift-ci bot added the needs-ok-to-test label Nov 7, 2024

lburgazzoli reviewed Nov 7, 2024

View reviewed changes

xieshenzh reviewed Nov 7, 2024

View reviewed changes

TomerFi marked this pull request as ready for review November 8, 2024 01:50

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 8, 2024

openshift-ci bot requested review from Jooho and terrytangyuan November 8, 2024 01:50

lburgazzoli reviewed Nov 8, 2024

View reviewed changes

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Nov 8, 2024

TomerFi mentioned this pull request Nov 10, 2024

Create webhook for NIM enablement #288

Merged

3 tasks

TomerFi commented Nov 10, 2024

View reviewed changes

controllers/constants/constants.go Show resolved Hide resolved

terrytangyuan reviewed Nov 11, 2024

View reviewed changes

controllers/nim_account_controller.go Show resolved Hide resolved

controllers/nim_account_controller.go Outdated Show resolved Hide resolved

controllers/nim_account_controller.go Show resolved Hide resolved

hack/verify_nvidia_nim_api.go Outdated Show resolved Hide resolved

israel-hdez reviewed Nov 12, 2024

View reviewed changes

andrewballantyne mentioned this pull request Nov 12, 2024

feat: Modify NIM enablement process opendatahub-io/odh-dashboard#3455

Merged

6 tasks

TomerFi force-pushed the nim-account-controller branch from c777acb to 4d86c0f Compare November 13, 2024 00:38

israel-hdez approved these changes Nov 13, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold label Nov 13, 2024

openshift-ci bot assigned israel-hdez Nov 13, 2024

openshift-ci bot added the lgtm label Nov 13, 2024

TomerFi added 23 commits November 14, 2024 12:41

feat: added nim account controller

2c18ef8

Signed-off-by: Tomer Figenblat <[email protected]>

test: added tests for the new nim account controller

d7b2d07

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed change request, owned resources are deleted by k8s gc

0158e82

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed change request, rearrange status conditions reporting

8d7b462

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed change request, use server side apply where applicable

8eab67f

Signed-off-by: Tomer Figenblat <[email protected]>

docs: clarified code docs

0ed4967

Signed-off-by: Tomer Figenblat <[email protected]>

chore: switched to utility func for gvk fetch

7cf4a5b

Signed-off-by: Tomer Figenblat <[email protected]>

chore: added count to utility script

ff4f982

Signed-off-by: Tomer Figenblat <[email protected]>

fix: cleanup resources if failed to validate the api key

d250bee

Signed-off-by: Tomer Figenblat <[email protected]>

test: fixed failing ci test (worked locally)

85c9c8b

Signed-off-by: Tomer Figenblat <[email protected]>

fix: added pagesize to catalog query

d93ee30

Signed-off-by: Tomer Figenblat <[email protected]>

fix: added missing rbac for account status

58c0d41

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed change request, clarify variable name

ad4804e

Signed-off-by: Tomer Figenblat <[email protected]>

fix: intial account stauts should be unknown

84855cd

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed various small change request

33a822e

Signed-off-by: Tomer Figenblat <[email protected]>

fix: several fixes and requests

b1f2aab

Signed-off-by: Tomer Figenblat <[email protected]>

test: fixed flaky test

c78583b

Signed-off-by: Tomer Figenblat <[email protected]>

fix: missing finalizer permissions

eb06f7d

Signed-off-by: Tomer Figenblat <[email protected]>

fix: missing create perms for templates

e11013c

Signed-off-by: Tomer Figenblat <[email protected]>

test: made test more reliable

c6d0853

Signed-off-by: Tomer Figenblat <[email protected]>

test: fixed flaky test by increasing timeout

7100903

Signed-off-by: Tomer Figenblat <[email protected]>

fix: wrong type of pull secret

82eefad

Signed-off-by: Tomer Figenblat <[email protected]>

test: nim account should be created in seperated namespace per testcases

2377baf

Signed-off-by: Tomer Figenblat <[email protected]>

TomerFi force-pushed the nim-account-controller branch from ecae6b7 to 2377baf Compare November 14, 2024 17:45

openshift-ci bot added lgtm and removed do-not-merge/hold labels Nov 14, 2024

openshift-merge-bot bot merged commit a0c3ec4 into opendatahub-io:main Nov 14, 2024
5 checks passed

TomerFi deleted the nim-account-controller branch November 14, 2024 18:28

Task/NVPE-10: Create NIM Account controller #289

Task/NVPE-10: Create NIM Account controller #289

Conversation

TomerFi commented Nov 7, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

openshift-ci-robot commented Nov 7, 2024 • edited by openshift-ci bot Loading

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Nov 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomerFi Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomerFi Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jooho commented Nov 8, 2024

TomerFi commented Nov 9, 2024

israel-hdez Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomerFi commented Nov 13, 2024

openshift-ci-robot commented Nov 13, 2024 • edited by openshift-ci bot Loading

Description

How Has This Been Tested?

Merge criteria:

israel-hdez left a comment

Choose a reason for hiding this comment

TomerFi commented Nov 14, 2024

israel-hdez commented Nov 14, 2024

TomerFi commented Nov 7, 2024 •

edited

Loading

openshift-ci-robot commented Nov 7, 2024 •

edited by openshift-ci bot

Loading

TomerFi Nov 9, 2024 •

edited

Loading

TomerFi Nov 8, 2024 •

edited

Loading

israel-hdez Nov 12, 2024 •

edited

Loading

openshift-ci-robot commented Nov 13, 2024 •

edited by openshift-ci bot

Loading