✨ Add reference to HostUpdatePolicy in Servicing. #1969

rhjanders · 2024-09-20T11:09:50Z

What this PR does / why we need it:

This PR enables BMO to run Ironic servicing operations (such as applying firmware settings changes - or in the future firmware updates to already provisioned nodes). Servicing is an opt-in feature and is controlled by creation of a HostUpdatePolicy for a node with attributes indicating the desire to make changes to firmware configuration onReboot.

This is a partial implementation of https://github.com/metal3-io/metal3-docs/blob/main/design/baremetal-operator/host-live-updates.md (please note only firmware settings changes are currently supported, firmware update support will be added next).

metal3-io-bot · 2024-09-20T11:09:54Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

dtantsur · 2024-09-20T11:32:36Z

pkg/provisioner/ironic/servicing.go

+		)
+	}
+
+	// TODO: Add service steps for firmware updates


TODO needs to be resolved, although you can do it in the next change.

Yeah, I will push a PR for this using this one as base.

controllers/metal3.io/baremetalhost_controller.go

pkg/provisioner/provisioner.go

controllers/metal3.io/baremetalhost_controller_test.go

controllers/metal3.io/baremetalhost_controller.go

controllers/metal3.io/baremetalhost_controller_test.go

controllers/metal3.io/baremetalhost_controller.go

controllers/metal3.io/baremetalhost_controller_test.go

iurygregory · 2024-10-17T11:33:32Z

LGTM, thanks for working on it @rhjanders

zaneb · 2024-10-18T03:01:05Z

controllers/metal3.io/baremetalhost_controller.go

+	if clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) {
+		// We need to give the HostFirmwareSettings controller some time to
+		// catch up with the changes, otherwise we risk starting the same
+		// operation again.


I don't think this is adequate. We should avoid race conditions by actually preventing them, not just putting in sleeps until they don't seem to happen any more. In a distributed system, delays can be arbitrarily long.

Specifically, once we are successfully done, I don't think we should try again until the next reboot.

Then we need to rethink the whole architecture of having separate controllers for HFS, HFC, DataImage and so on (in fact, a similar race is reported for DataImages now).

The alternative, I guess, it to pass the HFS object all the way here and update its status directly. It's not great either (responsibility violation). But I guess better then just rescheduling and hoping for better.

Discussion consensus: use generation number to decide if we start servicing or not. Error handling can be followed from manual cleaning / preparing state etc. Update in-code comment so that it can be addressed in a follow up.

adding a FIXME note for a followup change.

zaneb · 2024-10-18T03:05:55Z

controllers/metal3.io/baremetalhost_controller.go

 		return actionUpdate{}
 	}

 	desiredPowerOnState := info.host.Spec.Online

+	provState := info.host.Status.Provisioning.State
+	// Normal reboots only work in provisioned states, changing online is also possible for available hosts.
+	isProvisioned := provState == metal3api.StateProvisioned || provState == metal3api.StateExternallyProvisioned


I don't like that we're doing state-dependent stuff outside of the state machine here. Maybe we could pass this in as an argument?

$ grep metal3api.State controllers/metal3.io/baremetalhost_controller.go | wc -l 9

Also, this is a copy of an existing line (see below).

consensus: follow up material potentially.

adding a FIXME note for this also

pkg/provisioner/ironic/servicing.go

zaneb · 2024-10-18T03:42:09Z

controllers/metal3.io/baremetalhost_controller.go

+		if info.host.Status.OperationalStatus == metal3api.OperationalStatusServicing {
+			targetOperationalStatus = metal3api.OperationalStatusServicing
+		}
+		clearErrorWithStatus(info.host, targetOperationalStatus)


Clearing the error here is only the right thing to do if it is a power state error.
As written, this will have the effect of clearing ServicingErrors before the servicing code has a chance to see them.

hopefully addressed together with #1969 (comment)

janders TODO: reapply the patch

controllers/metal3.io/baremetalhost_controller.go

zaneb · 2024-10-18T04:11:52Z

controllers/metal3.io/baremetalhost_controller.go

+		result = recordActionFailure(info, metal3api.ServicingError, provResult.ErrorMessage)
+		return result, true
+	}
+	if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) {


This is only putting the host into the servicing status after it has confirmed to have started. I think it would be better to put it into this status as soon as we know we want it, so that if any errors occur the status at the time reflects it. This is the approach we take with the provisioning states.

If writing this fails then I think we end up in a race with the power coming back on after a reboot.

If we do this, we cannot check for info.host.Status.ErrorType on line 1402.

Okay, we can probably cache restartOnFailure in a local variable before we clear the error. That will do, I guess?

controllers/metal3.io/baremetalhost_controller.go

zaneb · 2024-10-18T06:17:00Z

controllers/metal3.io/baremetalhost_controller.go

+	provResult, started, err := prov.Service(servicingData, dirty,
+		info.host.Status.ErrorType == metal3api.ServicingError)
+	if err != nil {
+		return actionError{fmt.Errorf("error servicing host: %w", err)}, false


Currently this is ignored as if you had returned nil.

If I understood correctly, we just need the return here

return actionError{fmt.Errorf("error servicing host: %w", err)}

Is this a correct assumption @zaneb ?

Yep, due to #1969 (comment) we no longer need to return multiple values from this function at all.

TODO(janders): should be resolved in latest commit, make sure it stays this way after I reapply the other fix

zaneb · 2024-10-18T06:29:27Z

controllers/metal3.io/baremetalhost_controller.go

+	}
+	if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) {
+		if fwDirty {
+			info.host.Status.Provisioning.Firmware = info.host.Spec.Firmware.DeepCopy()


Updating as soon as we've started rather than when we've finished means that if we never actually succeed before leaving this state (e.g. by deprovisioning) we lose the signal that the update didn't actually happen.

This is how it works for non-servicing case too. Unfortunately, for this older firmware approach, we have no way to learn from Ironic if it was applied or not.

at minimum, document at code, can be a follow-up material.

Signed-off-by: Dmitry Tantsur <[email protected]>

Servicing only runs when a host is powered off (either completely or by rebooting it). Signed-off-by: Dmitry Tantsur <[email protected]>

Signed-off-by: Jacob Anders <[email protected]> Removed unused ServicingData fields.

…ageHostPower. Signed-off-by: Jacob Anders <[email protected]>

metal3-io-bot · 2024-10-22T13:00:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Jacob Anders <[email protected]>

iurygregory · 2024-10-23T21:10:19Z

LGTM

metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 20, 2024

metal3-io-bot requested review from elfosardo and zhouhao3 September 20, 2024 11:09

metal3-io-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 20, 2024

dtantsur reviewed Sep 20, 2024

View reviewed changes

rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 0d8b518 to 509027a Compare September 24, 2024 04:31

dtantsur reviewed Sep 24, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Sep 24, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Sep 26, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

rhjanders force-pushed the servicing-hostupdatepolicy branch from a327bf3 to d532cf4 Compare October 15, 2024 10:05

rhjanders changed the title ~~Add reference to HostUpdatePolicy in Servicing.~~ ✨ Add reference to HostUpdatePolicy in Servicing. Oct 15, 2024

rhjanders marked this pull request as ready for review October 15, 2024 11:41

metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024

metal3-io-bot requested a review from s3rj1k October 15, 2024 11:41

rhjanders force-pushed the servicing-hostupdatepolicy branch from f165c0b to 7eda476 Compare October 16, 2024 07:02

dtantsur reviewed Oct 16, 2024

View reviewed changes

pkg/provisioner/provisioner.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Show resolved Hide resolved

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Outdated Show resolved Hide resolved

rhjanders force-pushed the servicing-hostupdatepolicy branch from 206eb33 to 939aace Compare October 16, 2024 11:10

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Outdated Show resolved Hide resolved

rhjanders force-pushed the servicing-hostupdatepolicy branch from 939aace to fbc9f58 Compare October 16, 2024 11:19

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 16, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Show resolved Hide resolved

rhjanders force-pushed the servicing-hostupdatepolicy branch 3 times, most recently from 0f11a97 to 7f4b773 Compare October 17, 2024 06:55

rhjanders force-pushed the servicing-hostupdatepolicy branch from 7f4b773 to f7d8145 Compare October 17, 2024 06:57

dtantsur reviewed Oct 17, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 17, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

dtantsur reviewed Oct 17, 2024

View reviewed changes

controllers/metal3.io/baremetalhost_controller_test.go Show resolved Hide resolved

rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 87cb67e to 95aa70b Compare October 17, 2024 11:15

rhjanders force-pushed the servicing-hostupdatepolicy branch from 95aa70b to e97612c Compare October 17, 2024 12:22

zaneb reviewed Oct 18, 2024

View reviewed changes

rhjanders force-pushed the servicing-hostupdatepolicy branch from e97612c to b3a7d84 Compare October 18, 2024 12:21

dtantsur and others added 3 commits October 18, 2024 22:31

Provisioner code for servicing

9262b7f

Signed-off-by: Dmitry Tantsur <[email protected]>

Wire in servicing (updating provisioned hosts)

7444311

Servicing only runs when a host is powered off (either completely or by rebooting it). Signed-off-by: Dmitry Tantsur <[email protected]>

Add references to HostUpdatePolicy in checkServicing code.

2b52594

Signed-off-by: Jacob Anders <[email protected]> Removed unused ServicingData fields.

rhjanders force-pushed the servicing-hostupdatepolicy branch from b3a7d84 to da1c7ed Compare October 18, 2024 12:31

Move hostUpdatePolicySetOwnerReference calls from registerHost to man…

5dfc536

…ageHostPower. Signed-off-by: Jacob Anders <[email protected]>

rhjanders force-pushed the servicing-hostupdatepolicy branch from da1c7ed to 5dfc536 Compare October 18, 2024 22:27

rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 72270cf to cbe1eb9 Compare October 23, 2024 13:11

Removed second return value from doServiceIfNeeded function.

324e111

Signed-off-by: Jacob Anders <[email protected]>

rhjanders force-pushed the servicing-hostupdatepolicy branch from cbe1eb9 to 324e111 Compare October 23, 2024 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add reference to HostUpdatePolicy in Servicing. #1969

✨ Add reference to HostUpdatePolicy in Servicing. #1969

rhjanders commented Sep 20, 2024 •

edited

Loading

metal3-io-bot commented Sep 20, 2024

dtantsur Sep 20, 2024

iurygregory Oct 15, 2024

iurygregory commented Oct 17, 2024

zaneb Oct 18, 2024

dtantsur Oct 18, 2024

dtantsur Oct 18, 2024

rhjanders Oct 23, 2024 •

edited

Loading

rhjanders Oct 23, 2024

zaneb Oct 18, 2024

dtantsur Oct 18, 2024

rhjanders Oct 23, 2024

rhjanders Oct 23, 2024

zaneb Oct 18, 2024

rhjanders Oct 18, 2024

rhjanders Oct 23, 2024

zaneb Oct 18, 2024

dtantsur Oct 18, 2024

dtantsur Oct 18, 2024

zaneb Oct 18, 2024

iurygregory Oct 22, 2024

zaneb Oct 22, 2024

rhjanders Oct 23, 2024

zaneb Oct 18, 2024

dtantsur Oct 18, 2024

rhjanders Oct 23, 2024

metal3-io-bot commented Oct 22, 2024

iurygregory commented Oct 23, 2024

✨ Add reference to HostUpdatePolicy in Servicing. #1969

Are you sure you want to change the base?

✨ Add reference to HostUpdatePolicy in Servicing. #1969

Conversation

rhjanders commented Sep 20, 2024 • edited Loading

metal3-io-bot commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iurygregory commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhjanders Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metal3-io-bot commented Oct 22, 2024

iurygregory commented Oct 23, 2024

rhjanders commented Sep 20, 2024 •

edited

Loading

rhjanders Oct 23, 2024 •

edited

Loading