-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component upgrade state information #1241
Comments
My current focus is around the use of this for transceivers...I may come back with more comments on the generality of this to other component types.
|
|
Thanks for accommodating my suggestions. Is there a reason you didn't add a |
It would be good to understand what the expected API is for initiating an image install. (In OC, we model config and state together. Agreed that gnoi would be the RPC path to invoke the install action) The gnoi.OS service https://github.com/openconfig/gnoi/blob/main/os/os.proto#L40-L45 currently requires that any hardware component firmware installs are embedded within a device OS image. If we intend to use gnoi to invoke a component image install, we should propose an update this text and a suitable RPC. I'd also recommend this be renamed to "install" instead of "upgrade". Upgrade implies an incremental version, but it's possible that the version could be decremented. Changing the meaning from upgrade to "install" makes this state data generic to whether it's an increment or decrement of firmware version. |
@proberts2022 I think adding the units should be the best practice. I'm not sure it's done consistently, but I think we should. I've added that now. |
@dplore I agree, 'install' is better and the PR is now updated with that change. Updating that gNOI text makes sense and I can do that. I was thinking I'd tackle the gNOI work in an upcoming PR as we might want more discussion on that and we can then decouple some PRs, but please let me know. |
A few questions for better understanding some corner cases:
Thoughts? |
|
I was under the impression that the proposed YANG extensions would apply during a SW upgrade to provide better visibility into the installation progress across the system. Perhaps I have misinterpreted the intention. The OIF proposal describes such case where a specific Host-SW-release comes with a bouquet of FW-images. A specific FW-image out of the bouquet is installed into the component once a matching HW is found. This would be a host pre-load of FW images and leaves it up to the implementation whether dual memory banks are used or not. Such procedure appears quite pragmatic to me and additional visibility into where the FW upgrade currently stands can be quite useful in case things 'hang'. In contrast, it would be a little touchy to modify a FW package underneath a SW release without properly updating the SW release information. In case of surprises, debugging and RCA can become quite challenging. Also operationally it is hard to manage not to use a specific FW image if a combination of other factors (Host HW, SW, specific RI,...) applies - in particular if this FW image is already used in other combinations. Is it really worth going down this primrose path? |
Yep, the intention for this specific PR is to provide visibility during the installation process. I agree there is complexity in decoupling these FW images with host SW images :) However, I know Google (and from my understanding also some other hyperscalers) feels pretty strongly that this is needed...and is already starting to happen. There are at least a couple reasons for this.
|
It seems there is a need to be addressed since a while. Is there something to learn something from the present mode of operation? How is the FW update performed today while the transponder function is still in another chassis? IOW, how does the FW installation work when you are receiving custom FW images from transceiver vendors that are specific for Google, Meta, MSFT? ... figured out that oif2024.339.06 provides some insight into the state of the art and is a good starting point going forward. |
If you plan to introduce a new RPC/API interface for this type of upgrades, do you still need all of the proposed yang exposed via gnmi? For example, if you are going to create a gRPC service, the below leafs seem to belong to the new service (explicitly or implicitly):
where the upgrade status can be streamed to the client which initiated the action. |
In the case of the transponder function being in a separate chassis, when it comes to transceivers, we haven't actually faced this problem. This is because in that scenario we are only using gray optics in which optic firmware upgrades aren't much of an issue. This is because the piece that is iterating more quickly (and requiring upgrades) is the DWDM technology. When the transponder function is in a separate chassis, the DWDM technology lies within the cards of that chassis. This allows for a simpler scenario because the vendor of the OS and the firmware of the card is guaranteed to be same and everything is nicely tested/bundled together. |
I view these at two separate things because one is an 'action' to initiate something on the device (i.e. start the install). Where the other is a 'read' operation. Installs can take a really long time and involve restarts. Because of that it isn't practical to maintain a long running RPC that may be many 10s of minutes (maybe multiple hours for an OS install), especially with restarts where the RPC will have to terminate. We could have a gNOI that returns these details and could be called ad-hoc, but in that case, it feels like a read-only RPC which is replicating gNMI read-only functionality. |
Does a transceiver FW upgrade take hours? I think it is somewhat idiomatic to couple actions and related state in the same service, esp. when state is directly associated with the ongoing action (such as progress reporting). Doesn't have to be a single RPC. |
Hours is probably unlikely, but I've been told that due to the sometimes very limited bus speed on some transceivers the transfer can take up to an hour in some cases. But sounds like many are in ~10 min range. |
I agree there are some slightly blurry lines here regarding what belongs in a gNOI RPC vs a gNMI read (via modeled data). Generally (may not be perfect) I think we should aspire towards keeping gNOI as temporal actions that are performed on a device and gNMI as the place to read state data. @robshakir @dplore would be curious on your thoughts as well. |
I'd argue that OC yang/gNMI is a place for globally significant state data (unless you are using yang actions instead of gNOI; but that's a yang1.1 feature), and perhaps not all data in this scenario has this significance. If there's a system that is responsible for performing FW upgrades on transceivers, then it is probably interested in monitoring the progress and status of this operation. Do other systems care about the % of completeness of the upgrade? I also don't think that this model extends well to other component types (hence I left a suggestion to limit this to transceivers in the PR), which is also a contributing factor. |
I'd like to propose adding upgrade state information for components. This will allow details regarding upgrades to be exposed to users as an upgrade process can take many minutes to hours. This would be read-only information.
This could support components such as transceivers (i.e. ZR transceivers) or the device operating system.
These could be placed under a new grouping such as:
openconfig/platform/component/state/upgrade/
Proposed leaves would be:
new-firmware-version: string, the new firmware package version.
new-firmware-version-service-impacting: bool, true if the firmware would require a service impacting upgrade to take effect.
status: identityref (sw loading, in-progress, fail, complete)
step: string, describing the step the device is attempting to do.
step-percent-complete: oc-types:percentage, the percent of the step that is completed.
total-percent-complete: oc-types:percentage, the percent of the total upgrade that is completed.
start-time: timeticks64, the time when the upgrade was initiated.
duration: yang:counter64, the time elapsed since the upgrade was initiated.
stop-time: timeticks64, the time when the upgrade was initiated.
last-known-failure: string, describing the last known failure encountered.
@robshakir @ahsaanyousaf
The text was updated successfully, but these errors were encountered: