Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design-proposal: improved handling of the VM FW ID #347

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions design-proposals/persist-vmi-firmware-uuid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Overview
This proposal introduces a mechanism to make the firmware UUID of a Virtual Machine in KubeVirt universally unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it clear that this is fixing existing behavior (bug fix). Which brings a question why have we opted for design proposal but I understand...

It ensures the UUID remains consistent across VM restarts, preserving stability and reliability.
This change addresses a bug in the current behavior, where the UUID is derived from the VM name,
leading to potential collisions and inconsistencies for workloads relying on a truly unique and stable UUID.

## Motivation
* Duplicate UUIDs cause burden with third-party automation frameworks such as Gitops tools, making it difficult to maintain unique VM UUID.
Such tools have to manually assign unique UUIDs to VMs, thus also manage them externally.
* The non-persistent nature of UUIDs may trigger redundant cloud-init setup upon VM restore.
* The non-persistent nature of UUIDs may trigger redundant license activation events during VM lifetime i.e. Windows key activation re-occur when power off and power on the VM.

## Goals
* Improve the uniqueness of the VMI firmware UUID to become independent of the VM name and namespace.
* Maintain the UUID persistence over the VM lifecycle.
* Maintain backward compatibility

## Non Goals
* Maintain UUID persistence with VM backup/restore.
* Change UUID of currently existing VMs

## Definition of Users
VM owners: who require consistent firmware UUIDs for their applications.
dasionov marked this conversation as resolved.
Show resolved Hide resolved
cluster-admin: to ensure VMs have universally unique firmware IDs

## User Stories

### Supported Cases

#### 1. Creating and Starting a New VM
**As a user**,
I want to create and start a new VM in KubeVirt,
**so that** it gets a unique, persistent firmware UUID for consistency across reboots, migrations, and clusters.

- **Scenario**: Creating VM `X` in namespace `Y` on Cluster `A` and `B` results in different UUIDs.
- **Expected**: UUIDs are unique universally to avoid conflicts.

---

#### 2. Starting a VM Created by an Older KubeVirt Version
**As a user**,
I want to start a VM from an older KubeVirt version,
**so that** it is assigned a persistent UUID if one wasn’t set before.

- **Scenario**: A VM without a UUID starts on an upgraded KubeVirt version.
- **Expected**: The old UUID is persisted in the VM spec to ensure consistency.


## Repos
Kubevirt/kubevirt


## Proposed Solution
dasionov marked this conversation as resolved.
Show resolved Hide resolved

### Description
The firmware UUID persistence mechanism will operate as follows:

1. **New VMs**:
- If the firmware UUID is not explicitly defined in `vm.spec.template.spec.firmware.uuid`, the mutator webhook will automatically set the firmware UUID on create operation.
- We can use what k8s uses to generate uuids via `"github.com/google/uuid"` package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about standalone VMIs? I assume uid will be used as well? The lack of uid in CREATE may be a bigger issue here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate please what do you mean the lack of uid in CREATE ?
in the design when a VM is being created we make sure the firmware uuid is populated by the mutator web-hook if the user didn't provide his own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Webhooks have an operation associated to them (CREATE, UPDATE, DELETE, or CONNECT). The CREATE webhook is called before metadata.uid is set. That field is set when the object is persisted to etcd (after webhooks are called). So your scheme of assigning firmware id to metadata.uid cannot work for CREATE.

You will have metadata.uid for other operations but if UPDATE is called for a VM with no fw id, how do you know if it is a "new" VM (gets metadata uid) or old VM (hash of name)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mhenriks! That's a good point.
One possible resolution to this is to handle the firmware UUID assignment from the controller instead of the webhooks.

FYI @vladikr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying this point, @mhenriks. I now see your concern more clearly. This limitation indeed renders the webhook logic redundant, as it cannot reliably distinguish between new and existing VMs during the create operation due to the absence of the vm.meta.uid at this stage.

Implementing additional logic in the webhook to determine whether a VM is new feels out of scope for a mutation webhook's responsibilities. Instead, this kind of logic would be better suited within the controller, where we can manage such scenarios more effectively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we move the logic to the controller, and make sure we establish a way to distinguish between new VMs and existing VMs, we won't need to interfere with the restore controller (to try to patch the old legacy calculation for the firmware uuid, the controller will do that for us since the mutation web-hook is out the the scope).
it's 2 birds with one stone.

the only con is that the establishment of new VMs vs existing VMs in the controller might be a little bit complex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @mhenriks thank you for steering this in the right direction.
@xpivarc FYI

I'm worried about the controller-based solution since it will add latency when setting this field. There could be another controller that makes assumptions about this field and it may potentially be in a race with the VM controller. I prefer not to update the VM spec fields via a controller.

Perhaps we could just set a random UID?
Alternatively, we could fold back to the idea of storing the UID in the VM status.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think assigning a new random UUID is the best we can do for new VMs (CREATE mutating webhook would work). I'm still unclear how how existing VMs can best be be assigned the old style FW id

2. **Existing VMs**:
- For VMs created before the upgrade, the VM controller will patch the `vm.spec.template.spec.firmware.uuid` field,
with a value calculated using the legacy logic (based on the VM name).
This ensures backward compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have an annotation that stores the legacy uuid and then patch the old VMs with the new uuid on upgrade? What are the blockers in terms of backward-compatibility?


3. **Mitigation**:
- Backups created before the implementation of this mechanism will result in VMs receiving a new UUID upon restore, as the firmware UUID field will be absent in older backups.
- Users must be aware of this limitation and plan accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hint here what they should be doing?
It would be nice if we can have a script that adds the buggy firmware UUID on existing snapshots.


### Workflow
1. **Mutator Webhook**:
- During the creation of a new VM, if the `vm.spec.template.spec.firmware.uuid` field is not set, the webhook will set it to a random uuid based on the package `"github.com/google/uuid"`.

2. **Controller Logic**:
- For existing VMs:
- If the firmware UUID field is absent, the controller will calculate the UUID using the legacy logic and patch the VM template spec.
- This ensures backward compatibility without disrupting existing workflows.

3. **Backup and Restore**:
- Backups created before this mechanism will not include the firmware UUID, leading to new UUIDs being generated upon restore.
- Users must be aware of this limitation and plan accordingly.

---

## Previously Proposed Solutions

Here are a few alternatives that were previously considered:

### 1. Create a New Firmware UUID Field Under Status
- **Main Idea**: The UUID would be generated during the VM's first boot and saved to a dedicated firmware UUID field in the status.
- **Pros**:
- Avoids abusing the spec.
- Keeps the spec clean.
- **Cons**:
- Adds a new API field, which is difficult to remove and could increase load, hurt performance, and make the API less compact.

---

### 2. Introduce a Breaking Change
- **Main Idea**: The UUID would be generated during the VM's first boot but based on both the name and namespace, resulting in a breaking change. Users would be warned over several versions, and tooling would be provided to add the current firmware UUID to the spec to avoid breaking workloads.
- **Pros**:
- Avoids abusing the spec and keeps it clean.
- Simple to implement.
- **Cons**:
- Requires user intervention to avoid breaking existing workloads.

---

### 3. Upgrade-Specific Persistence of Firmware UUID
- **Main Idea**: Just before KubeVirt is upgraded, persist `Firmware.UUID` of existing VMs in `vm.spec`. From that point on, any VM without `Firmware.UUID` is considered new. For new VMs, use `vm.metadata.uid` if `vm.spec.template.spec.Firmware.UUID` is not defined.
- **Pros**:
- The upgrade-specific code can be removed after two or three releases.
- Simple logic for handling UUIDs post-upgrade.
- Limited disruption to GitOps, affecting only pre-existing (and somewhat buggy) VMs.
- **Cons**:
- Requires additional upgrade-specific code (possibly in `virt-operator`) to handle the persistence.

---

### 4. Dual UUID Logic with Upgrade Annotation
- **Main Idea**: During an upgrade, KubeVirt would use two UUID generation logics:
1. Fresh clusters would use the non-buggy `vm.metadata.uid`.
2. Upgraded clusters would have the KubeVirt CR annotated with `kubevirt.io/use-buggy-uuid`. Clusters with this annotation would continue using the buggy UUID logic.
- **Pros**:
- Prevents the bug from spreading to new clusters.
- Buys time to think of a more robust solution.
- **Cons**:
- Does not solve the problem for existing clusters.
- Adds complexity by requiring dual logic for UUID generation.

dasionov marked this conversation as resolved.
Show resolved Hide resolved

## Scalability
The proposed changes have no anticipated impact on scalability capabilities of the KubeVirt framework

## Update/Rollback Compatibility
Backups created before implementing the persistent firmware UUID mechanism will not include the firmware UUID in the VM's spec.
As a result, restoring such backups will generate a new UUID for the VM.
This change may lead to compatibility issues for workloads or systems that rely on consistent UUIDs, such as licensing servers or configuration management systems.
Users are advised to take this into consideration and plan backup and restore operations accordingly.
Comment on lines +136 to +140
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can tweak and restore controller's code such that if a backup did not consist a firmware UUID in the spec, the controller would know to set up the UUID in the old way. This will ensure that old backups are compatible.

FYI @akalenyu @ShellyKa13

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to trace the restore code to this point:
and it seems like this could be a potential area to manipulate the restored VM.
@akalenyu @ShellyKa13, could you please provide an estimate or insight on whether this approach is feasible? Your input would be greatly appreciated!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @mhenriks, it appears that the approach to tweak the missing uuid in snapshots is a bad practice, snapshots should be immutable once created, the snapshot can be base for restoring more then one VM which will result all those vms to have same firmware uuid (assuming those VMs are restored to different namespaces so their names wont collide).


## Testing

### Existing Tests
- Verify that a newly created VMI has a unique firmware UUID assigned.
- Ensure the UUID persists across VMI restarts.

### Scenarios Needing Coverage

1. **Old VM Patching**:
Validate that old VMs without a firmware UUID receive one calculated using the legacy logic.

2. **New VM Creation**:
Verify that the firmware UUID is set to via `"github.com/google/uuid"` when not explicitly defined.


# Implementation Phases

1. **Alpha Phase**
- Persist firmware UUID in the `VM` template spec field.
- Implement logic for:
- **New VMs**: Auto-generate and store random UUID if not explicitly provided.
- **Existing VMs**: Patch the `vm.spec.template.spec.firmware.uuid` field using the legacy calculation.
- Cover implementation with unit tests.

2. **Beta Phase**
- Ensure seamless upgrade compatibility for pre-existing VMs.
- Communicate changes through release notes and detailed documentation.

3. **GA Phase**
- in two versions we can assume that all clusters are already upgraded and the old code can be dropped.

4. **Feature Gate**
- **No Feature Gate Protection**: The behavior will be enabled by default upon implementation.