Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add reference to HostUpdatePolicy in Servicing. #1969

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rhjanders
Copy link
Member

@rhjanders rhjanders commented Sep 20, 2024

What this PR does / why we need it:

This PR enables BMO to run Ironic servicing operations (such as applying firmware settings changes - or in the future firmware updates to already provisioned nodes). Servicing is an opt-in feature and is controlled by creation of a HostUpdatePolicy for a node with attributes indicating the desire to make changes to firmware configuration onReboot.

This is a partial implementation of https://github.com/metal3-io/metal3-docs/blob/main/design/baremetal-operator/host-live-updates.md (please note only firmware settings changes are currently supported, firmware update support will be added next).

@metal3-io-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@metal3-io-bot metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 20, 2024
@metal3-io-bot metal3-io-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 20, 2024
)
}

// TODO: Add service steps for firmware updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO needs to be resolved, although you can do it in the next change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will push a PR for this using this one as base.

@rhjanders rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 0d8b518 to 509027a Compare September 24, 2024 04:31
@rhjanders rhjanders changed the title Add reference to HostUpdatePolicy in Servicing. ✨ Add reference to HostUpdatePolicy in Servicing. Oct 15, 2024
@rhjanders rhjanders marked this pull request as ready for review October 15, 2024 11:41
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@rhjanders rhjanders force-pushed the servicing-hostupdatepolicy branch 3 times, most recently from 0f11a97 to 7f4b773 Compare October 17, 2024 06:55
@rhjanders rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 87cb67e to 95aa70b Compare October 17, 2024 11:15
@iurygregory
Copy link
Member

LGTM, thanks for working on it @rhjanders

if clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) {
// We need to give the HostFirmwareSettings controller some time to
// catch up with the changes, otherwise we risk starting the same
// operation again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is adequate. We should avoid race conditions by actually preventing them, not just putting in sleeps until they don't seem to happen any more. In a distributed system, delays can be arbitrarily long.

Specifically, once we are successfully done, I don't think we should try again until the next reboot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to rethink the whole architecture of having separate controllers for HFS, HFC, DataImage and so on (in fact, a similar race is reported for DataImages now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative, I guess, it to pass the HFS object all the way here and update its status directly. It's not great either (responsibility violation). But I guess better then just rescheduling and hoping for better.

Copy link
Member Author

@rhjanders rhjanders Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion consensus: use generation number to decide if we start servicing or not. Error handling can be followed from manual cleaning / preparing state etc. Update in-code comment so that it can be addressed in a follow up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a FIXME note for a followup change.

return actionUpdate{}
}

desiredPowerOnState := info.host.Spec.Online

provState := info.host.Status.Provisioning.State
// Normal reboots only work in provisioned states, changing online is also possible for available hosts.
isProvisioned := provState == metal3api.StateProvisioned || provState == metal3api.StateExternallyProvisioned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we're doing state-dependent stuff outside of the state machine here. Maybe we could pass this in as an argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ grep metal3api.State controllers/metal3.io/baremetalhost_controller.go | wc -l
9

Also, this is a copy of an existing line (see below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus: follow up material potentially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a FIXME note for this also

pkg/provisioner/ironic/servicing.go Outdated Show resolved Hide resolved
if info.host.Status.OperationalStatus == metal3api.OperationalStatusServicing {
targetOperationalStatus = metal3api.OperationalStatusServicing
}
clearErrorWithStatus(info.host, targetOperationalStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing the error here is only the right thing to do if it is a power state error.
As written, this will have the effect of clearing ServicingErrors before the servicing code has a chance to see them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully addressed together with #1969 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

janders TODO: reapply the patch

result = recordActionFailure(info, metal3api.ServicingError, provResult.ErrorMessage)
return result, true
}
if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only putting the host into the servicing status after it has confirmed to have started. I think it would be better to put it into this status as soon as we know we want it, so that if any errors occur the status at the time reflects it. This is the approach we take with the provisioning states.

If writing this fails then I think we end up in a race with the power coming back on after a reboot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, we cannot check for info.host.Status.ErrorType on line 1402.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can probably cache restartOnFailure in a local variable before we clear the error. That will do, I guess?

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
provResult, started, err := prov.Service(servicingData, dirty,
info.host.Status.ErrorType == metal3api.ServicingError)
if err != nil {
return actionError{fmt.Errorf("error servicing host: %w", err)}, false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is ignored as if you had returned nil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, we just need the return here

return actionError{fmt.Errorf("error servicing host: %w", err)}

Is this a correct assumption @zaneb ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, due to #1969 (comment) we no longer need to return multiple values from this function at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(janders): should be resolved in latest commit, make sure it stays this way after I reapply the other fix

}
if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) {
if fwDirty {
info.host.Status.Provisioning.Firmware = info.host.Spec.Firmware.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating as soon as we've started rather than when we've finished means that if we never actually succeed before leaving this state (e.g. by deprovisioning) we lose the signal that the update didn't actually happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it works for non-servicing case too. Unfortunately, for this older firmware approach, we have no way to learn from Ironic if it was applied or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at minimum, document at code, can be a follow-up material.

dtantsur and others added 3 commits October 18, 2024 22:31
Signed-off-by: Dmitry Tantsur <[email protected]>
Servicing only runs when a host is powered off (either completely or
by rebooting it).

Signed-off-by: Dmitry Tantsur <[email protected]>
Signed-off-by: Jacob Anders <[email protected]>

Removed unused ServicingData fields.
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhjanders rhjanders force-pushed the servicing-hostupdatepolicy branch 2 times, most recently from 72270cf to cbe1eb9 Compare October 23, 2024 13:11
@iurygregory
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants