-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add reference to HostUpdatePolicy in Servicing. #1969
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
) | ||
} | ||
|
||
// TODO: Add service steps for firmware updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO needs to be resolved, although you can do it in the next change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will push a PR for this using this one as base.
0d8b518
to
509027a
Compare
a327bf3
to
d532cf4
Compare
f165c0b
to
7eda476
Compare
206eb33
to
939aace
Compare
939aace
to
fbc9f58
Compare
0f11a97
to
7f4b773
Compare
7f4b773
to
f7d8145
Compare
87cb67e
to
95aa70b
Compare
LGTM, thanks for working on it @rhjanders |
95aa70b
to
e97612c
Compare
if clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) { | ||
// We need to give the HostFirmwareSettings controller some time to | ||
// catch up with the changes, otherwise we risk starting the same | ||
// operation again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is adequate. We should avoid race conditions by actually preventing them, not just putting in sleeps until they don't seem to happen any more. In a distributed system, delays can be arbitrarily long.
Specifically, once we are successfully done, I don't think we should try again until the next reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need to rethink the whole architecture of having separate controllers for HFS, HFC, DataImage and so on (in fact, a similar race is reported for DataImages now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative, I guess, it to pass the HFS object all the way here and update its status directly. It's not great either (responsibility violation). But I guess better then just rescheduling and hoping for better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion consensus: use generation number to decide if we start servicing or not. Error handling can be followed from manual cleaning / preparing state etc. Update in-code comment so that it can be addressed in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a FIXME note for a followup change.
return actionUpdate{} | ||
} | ||
|
||
desiredPowerOnState := info.host.Spec.Online | ||
|
||
provState := info.host.Status.Provisioning.State | ||
// Normal reboots only work in provisioned states, changing online is also possible for available hosts. | ||
isProvisioned := provState == metal3api.StateProvisioned || provState == metal3api.StateExternallyProvisioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we're doing state-dependent stuff outside of the state machine here. Maybe we could pass this in as an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ grep metal3api.State controllers/metal3.io/baremetalhost_controller.go | wc -l
9
Also, this is a copy of an existing line (see below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consensus: follow up material potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a FIXME note for this also
if info.host.Status.OperationalStatus == metal3api.OperationalStatusServicing { | ||
targetOperationalStatus = metal3api.OperationalStatusServicing | ||
} | ||
clearErrorWithStatus(info.host, targetOperationalStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearing the error here is only the right thing to do if it is a power state error.
As written, this will have the effect of clearing ServicingErrors before the servicing code has a chance to see them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully addressed together with #1969 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
janders TODO: reapply the patch
result = recordActionFailure(info, metal3api.ServicingError, provResult.ErrorMessage) | ||
return result, true | ||
} | ||
if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only putting the host into the servicing status after it has confirmed to have started. I think it would be better to put it into this status as soon as we know we want it, so that if any errors occur the status at the time reflects it. This is the approach we take with the provisioning states.
If writing this fails then I think we end up in a race with the power coming back on after a reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, we cannot check for info.host.Status.ErrorType
on line 1402.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can probably cache restartOnFailure in a local variable before we clear the error. That will do, I guess?
provResult, started, err := prov.Service(servicingData, dirty, | ||
info.host.Status.ErrorType == metal3api.ServicingError) | ||
if err != nil { | ||
return actionError{fmt.Errorf("error servicing host: %w", err)}, false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is ignored as if you had returned nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly, we just need the return here
return actionError{fmt.Errorf("error servicing host: %w", err)}
Is this a correct assumption @zaneb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, due to #1969 (comment) we no longer need to return multiple values from this function at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(janders): should be resolved in latest commit, make sure it stays this way after I reapply the other fix
} | ||
if started && clearErrorWithStatus(info.host, metal3api.OperationalStatusServicing) { | ||
if fwDirty { | ||
info.host.Status.Provisioning.Firmware = info.host.Spec.Firmware.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating as soon as we've started rather than when we've finished means that if we never actually succeed before leaving this state (e.g. by deprovisioning) we lose the signal that the update didn't actually happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it works for non-servicing case too. Unfortunately, for this older firmware approach, we have no way to learn from Ironic if it was applied or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at minimum, document at code, can be a follow-up material.
e97612c
to
b3a7d84
Compare
Signed-off-by: Dmitry Tantsur <[email protected]>
Servicing only runs when a host is powered off (either completely or by rebooting it). Signed-off-by: Dmitry Tantsur <[email protected]>
Signed-off-by: Jacob Anders <[email protected]> Removed unused ServicingData fields.
b3a7d84
to
da1c7ed
Compare
…ageHostPower. Signed-off-by: Jacob Anders <[email protected]>
da1c7ed
to
5dfc536
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
72270cf
to
cbe1eb9
Compare
Signed-off-by: Jacob Anders <[email protected]>
cbe1eb9
to
324e111
Compare
LGTM |
What this PR does / why we need it:
This PR enables BMO to run Ironic servicing operations (such as applying firmware settings changes - or in the future firmware updates to already provisioned nodes). Servicing is an opt-in feature and is controlled by creation of a HostUpdatePolicy for a node with attributes indicating the desire to make changes to firmware configuration onReboot.
This is a partial implementation of https://github.com/metal3-io/metal3-docs/blob/main/design/baremetal-operator/host-live-updates.md (please note only firmware settings changes are currently supported, firmware update support will be added next).