diff --git a/docs/proposals/scheduling/workload-rebalancer/workload-rebalancer.md b/docs/proposals/scheduling/workload-rebalancer/workload-rebalancer.md new file mode 100644 index 000000000000..96f2584ed21c --- /dev/null +++ b/docs/proposals/scheduling/workload-rebalancer/workload-rebalancer.md @@ -0,0 +1,456 @@ +--- +title: Introduce a rebalance mechanism to actively trigger rescheduling of resource. +authors: + - "@chaosi-zju" +reviewers: + - "@RainbowMango" + - "@chaunceyjiang" + - "TBD" +approvers: + - "@RainbowMango" + - "TBD" + +creation-date: 2024-01-30 +--- + +# Introduce a mechanism to actively trigger rescheduling + +## Background + +According to current karmada scheduler, after replicas of workloads is scheduled, it will keep the scheduling result inert +and the replicas distribution will not change. Even if reschedule is triggered by modifing replicas or placement, +it will maintain the exist replicas distribution as closely as possible, only making minimal adjustments when necessary, +which minimizes disruptions and preserves the balance across clusters. + +However, in some scenarios, users hope to have approach to actively trigger a fresh rescheduling, which disregards the +previous assignment entirely and seeks to establish an entirely new replica distribution across clusters. + +### Motivation + +Assuming the user has propagated the workloads to member clusters, in some scenarios the current replicas distribution +is not the most expected, such as: + +* replicas migrated due to cluster failover, while now cluster recovered. +* replicas migrated due to application-level failover, while now each cluster has sufficient resources to run the replicas. +* as for `Aggregated` schedule strategy, replicas were initially distributed across multiple clusters due to resource + constraints, but now one cluster is enough to accommodate all replicas. + +Therefore, the user desires for an approach to trigger rescheduling so that the replicas distribution can do a rebalance. + +### Goals + +Introduce a rebalance mechanism to actively trigger rescheduling of resource. + +## Proposal + +* **Introduce a configurable field into resource binding, and when it changes, the scheduler will perform a `Fresh` mode + rescheduling.** + +> In contrast to existing assignment mode of rescheduling, such as those triggered by modification of replicas or +> placement, will maintain the exist replicas distribution as closely as possible, the assignment mode of this rescheduling +> disregards the previous assignment entirely and seeks to establish an entirely new replica distribution across clusters. +> +> We call the former assignment as `Steady` mode and the latter as `Fresh` mode. + +* **Introduce a new API, by which the users can actively adjust workload balance.** + +> Since directly manipulating bindings is not the recommended friendly way, it would be better to design a new API +> specifically for adjusting workload balance. Currently, it is mainly considered for rescheduling scenario. +> In the future, it may continue to expand more workload rebalance scenarios, such as migration, rollback and so on, +> with different assignment modes and rolling modes specified. + +### User story + +#### Story 1 + +In cluster failover scenario, replicas are distributed in member1 + memebr2 two clusters, however they would all migrate to +member2 cluster if member1 cluster fails. + +As a cluster administrator, I hope the replicas redistribute to two clusters when member1 cluster recovered, so that +the resources of the member1 cluster will be reutilized, also for the sake of high availability. + +#### Story 2 + +In application-level failover, low-priority applications may be preempted, resulting in shrinking from multi clusters +to single cluster due to cluster resources are in short supply +(refer to [Application-level Failover](https://karmada.io/docs/next/userguide/failover/application-failover#why-application-level-failover-is-required)). + +As a user, I hope the replicas of low-priority applications can be redistributed to multi clusters when +cluster resources are sufficient to ensure the high availability of application. + +#### Story 3 + +In `Aggregated` schedule type, replicas may still distribute across multiple clusters due to resource constraints. + +As a user, I hope the replicas to be redistributed in an aggregated strategy when any cluster has +sufficient resource to accommodate all replicas, so that the application better meets actual business requirements. + + +#### Story 4 + +In disaster-recovery scenario, replicas migrated from primary cluster to backup cluster when primary cluster failue. + +As a cluster administrator, I hope that replicas can migrate back when cluster restored, so that: + +1. restore to the disaster-recovery mode to ensure the reliability and stability of the cluster federation. +2. save the cost of the backup cluster. + +### Notes/Constraints/Caveats + +This ability is limited to triggering workload rebalance, the schedule result will be recalculated according to the +`Placement` in the current ResourceBinding. That means: + +* Take [story 1](#story-1) as an example, reschedule happened when cluster recoverred, but the new schedule result is not +guaranteed to be exactly the same as before the cluster failure, it is only guaranteed that the new schedule result meets +current `Placement`. + +* Rebalance is basing on `Placement` in the current ResourceBinding, not PropagationPolicy. So if your activation preference +of PropagationPolicy is `Lazy`, the rescheduling is still basing on previous `ResourceBinding` even if the current Policy has been changed. + +## Design Details + +### API change + +* As for *Introduce a configurable field into resource binding*, detail description is as follows: + +```go +// ResourceBindingSpec represents the expectation of ResourceBinding. +type ResourceBindingSpec struct { + ... + // RescheduleTriggeredAt is a timestamp representing when the referenced resource is triggered rescheduling. + // When this field is updated, it means a rescheduling is manually triggered by user, and the expected behavior + // of this action is to do a complete recalculation without referring to last scheduling results. + // It works with the status.lastScheduledTime field, and only when this timestamp is later than timestamp in + // status.lastScheduledTime will the rescheduling actually execute, otherwise, ignored. + // + // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC. + // +optional + RescheduleTriggeredAt *metav1.Time `json:"rescheduleTriggeredAt,omitempty"` + ... +} + +// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources. +type ResourceBindingStatus struct { + ... + // LastScheduledTime representing the latest timestamp when scheduler successfully finished a scheduling. + // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC. + // +optional + LastScheduledTime *metav1.Time `json:"lastScheduledTime,omitempty"` + ... +} +``` + +* As for *Introduce a new API, by which the users can actively adjust workload balance.*, we define a new API + named `WorkloadRebalancer` into a new apiGroup `apps.karmada.io/v1alpha1`: + +```go +// +genclient +// +genclient:nonNamespaced +// +kubebuilder:resource:path=workloadrebalancers,scope="Cluster" +// +kubebuilder:subresource:status +// +kubebuilder:storageversion +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +// WorkloadRebalancer represents the desired behavior and status of a job which can enforces a resource rebalance. +type WorkloadRebalancer struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + // Spec represents the specification of the desired behavior of WorkloadRebalancer. + // +required + Spec WorkloadRebalancerSpec `json:"spec"` + + // Status represents the status of WorkloadRebalancer. + // +optional + Status WorkloadRebalancerStatus `json:"status,omitempty"` +} + +// WorkloadRebalancerSpec represents the specification of the desired behavior of Reschedule. +type WorkloadRebalancerSpec struct { + // Workloads used to specify the list of expected resource. + // Nil or empty list is not allowed. + // +kubebuilder:validation:MinItems=1 + // +required + Workloads []ObjectReference `json:"workloads"` + + // ttlMinutesAfterFinished limits the lifetime of a WorkloadRebalancer that has finished execution (means each + // target workload is finished with result of Successful or Failed). + // If this field is set, ttlSecondsAfterFinished after the WorkloadRebalancer finishes, it is eligible to be automatically deleted. + // If this field is unset, the WorkloadRebalancer won't be automatically deleted. + // If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes. + // +optional + TTLMinutesAfterFinished *int32 `json:"ttlMinutesAfterFinished,omitempty"` +} + +// ObjectReference the expected resource. +type ObjectReference struct { + // APIVersion represents the API version of the target resource. + // +required + APIVersion string `json:"apiVersion"` + + // Kind represents the Kind of the target resource. + // +required + Kind string `json:"kind"` + + // Name of the target resource. + // +required + Name string `json:"name"` + + // Namespace of the target resource. + // Default is empty, which means it is a non-namespacescoped resource. + // +optional + Namespace string `json:"namespace,omitempty"` +} + +// WorkloadRebalancerStatus contains information about the current status of a WorkloadRebalancer +// updated periodically by schedule trigger controller. +type WorkloadRebalancerStatus struct { + // ObservedWorkloads contains information about the execution states and messages of target resources. + // +optional + ObservedWorkloads []ObservedWorkload `json:"observedWorkloads,omitempty"` + + // ObservedGeneration is the generation(.metadata.generation) observed by the controller. + // If ObservedGeneration is less than the generation in metadata means the controller hasn't confirmed + // the rebalance result or hasn't done the rebalance yet. + // optional + ObservedGeneration int64 `json:"observedGeneration,omitempty"` + + // LastUpdateTime represents the last update time of any field in WorkloadRebalancerStatus other than itself. + // optional + LastUpdateTime *metav1.Time `json:"lastUpdateTime,omitempty"` +} + +// ObservedWorkload the observed resource. +type ObservedWorkload struct { + // Workload the observed resource. + // +required + Workload ObjectReference `json:"workload"` + + // Result the observed rebalance result of resource. + // +optional + Result RebalanceResult `json:"result,omitempty"` + + // Reason represents a machine-readable description of why this resource rebalanced failed. + // +optional + Reason RebalanceFailedReason `json:"reason,omitempty"` +} + +// RebalanceResult the specific extent to which the resource has been rebalanced +type RebalanceResult string + +const ( + // RebalanceFailed the resource has been rebalance failed. + RebalanceFailed RebalanceResult = "Failed" + // RebalanceSuccessful the resource has been successfully rebalanced. + RebalanceSuccessful RebalanceResult = "Successful" +) + +// RebalanceFailedReason represents a machine-readable description of why this resource rebalanced failed. +type RebalanceFailedReason string + +const ( + // RebalanceObjectNotFound the resource referenced binding not found. + RebalanceObjectNotFound RebalanceFailedReason = "ReferencedBindingNotFound" +) +``` + +### Interpretation of Realization by an Example + +#### Step 1. apply WorkloadRebalancer resource yaml. + +Assuming there is two Deployment named `demo-deploy-1` and `demo-deploy-2`, and a ClusterRole named `demo-role`, +the user wants to trigger their rescheduling, he just needs to apply following yaml: + +```yaml +apiVersion: apps.karmada.io/v1alpha1 +kind: WorkloadRebalancer +metadata: + name: demo +spec: + workloads: + - apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-1 + namespace: default + - apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + name: demo-role + - apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-2 + namespace: default +``` + +> Notes: as for `workloads` field: +> 1. `name` sub-field is required; +> 2. `namespace` sub-field is required when it is a namespace scoped resource, while empty when it is a cluster wide + resource; + +This API specified a batch of resources which needs a rescheduling, and the user will get a `workloadrebalancer.apps.karmada.io/demo created` +result, which means the API created success. + +#### Step 2: Controller listens new API resource and do the rescheduling work. + +Then the controller will work to trigger the rescheduling for each resource, by writing the `CreationTimestamp` of WorkloadRebalancer +to each resource binding's new field `spec.placement.rescheduleTriggeredAt`. Take `deployment/demo-deploy-1` as example, +you will see its resourebinding be modified to: + +```yaml +apiVersion: work.karmada.io/v1alpha2 +kind: ResourceBinding +metadata: + name: demo-deploy-1-deployment + namespace: default +spec: + rescheduleTriggeredAt: "2024-04-17T15:04:05Z" # this field would be updated to CreationTimestamp of WorkloadRebalancer + ... +status: + lastScheduledTime: "2024-04-17T15:00:05Z" +``` + +Since field `rescheduleTriggeredAt` updated, and it is later than field `lastScheduledTime`, rescheduling is triggered. +If it succeeds, the `lastScheduledTime` field will be updated again, which represents scheduler finished a rescheduling +(if failed, the scheduler will retry), detail as follows: + +```yaml +apiVersion: work.karmada.io/v1alpha2 +kind: ResourceBinding +metadata: + name: demo-deploy-1-deployment + namespace: default +spec: + rescheduleTriggeredAt: "2024-04-17T15:04:05Z" + ... +status: + lastScheduledTime: "2024-04-17T15:04:05Z" + conditions: + - ... + - lastTransitionTime: "2024-04-17T15:00:05Z" + message: Binding has been scheduled successfully. + reason: Success + status: "True" + type: Scheduled + - lastTransitionTime: "2024-04-17T15:04:05Z" + message: All works have been successfully applied + reason: FullyAppliedSuccess + status: "True" + type: FullyApplied +``` + +Finally, all works have been successfully applied, the user will observe changes in the actual distribution of resource +template; the user can also see several recorded event in resource template, just like: + +```shell +$ kubectl --context karmada-apiserver describe deployment demo-deploy-1 +... +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + ... + Normal ScheduleBindingSucceed 31s default-scheduler Binding has been scheduled successfully. Result: {member2:2, member1:1} + Normal GetDependenciesSucceed 31s dependencies-distributor Get dependencies([]) succeed. + Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo-deploy-1) to cluster member1 + Normal AggregateStatusSucceed 31s (x4 over 31s) resource-binding-status-controller Update resourceBinding(default/demo-deploy-1-deployment) with AggregatedStatus successfully. + Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo-deploy-1) to cluster member2 +``` + +#### Step 3: check the status of WorkloadRebalancer. + +The user can observe the rebalance result at `status.observedWorkloads` of `workloadrebalancer/demo`, just like: + +```yaml +apiVersion: apps.karmada.io/v1alpha1 +kind: WorkloadRebalancer +metadata: + creationTimestamp: "2024-04-17T15:04:05Z" + name: demo +spec: + workloads: + - apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-1 + namespace: default + - apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + name: demo-role + - apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-2 + namespace: default +status: + observedWorkloads: + - result: Successful + workload: + apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-1 + namespace: default + - reason: ReferencedBindingNotFound + result: Failed + workload: + apiVersion: apps/v1 + kind: Deployment + name: demo-deploy-2 + namespace: default + - result: Successful + workload: + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + name: demo-role +``` + +> Notes: +> 1. the `observedWorkloads` is sorted in increasing dict order of the combined string of `apiVersion/kind/namespace/name` . +> 2. if workload referenced binding not found, it will be marked as `failed` without retry. +> 3. if workload rebalanced failed due to occasional network error, the controller will retry, and its `result` and `reason` +> field will left empty until it succees. + +### How to update this resource + +When `spec` filed of WorkloadRebalancer updated, we shall refresh the workload list in `status.observedWorkloads`: + +* a new workload added to spec list, just add it into status list too and do the rebalance. +* a workload deleted from previous spec list, keep it in status list if already success, and remove it if not. +* a workload is modified, just regard it as deleted an old one and inserted a new one. +* if the modification only involves a list order adjustment, no additional action, since `observedWorkloads` is arranged in increasing dict order. + +### How to auto clean resource + +referring to [Automatic Cleanup for Finished Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/). + +Introduces field `ttlMinutesAfterFinished` which limits the lifetime of a WorkloadRebalancer that has finished execution +(finished execution means each target workload is finished with result of `Successful` or `Failed`). + +* If this field is set, `ttlSecondsAfterFinished` after the WorkloadRebalancer finishes, it is eligible to be automatically deleted. +* If this field is unset, the WorkloadRebalancer won't be automatically deleted. +* If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes. + +Considering several corner cases: + +* case 1: if a new target workload was added into `WorkloadRebalancer` before `ttlMinutesAfterFinished` expired, + which means the finish time of the `WorkloadRebalancer` is refreshed, so the `delete` action is deferred since expire time is refreshed too. +* case 2: if `ttlMinutesAfterFinished` is modified before `ttlMinutesAfterFinished` expired, + `delete` action should be performed according to latest `ttlMinutesAfterFinished`. +* case 3: when we have got and checked latest `WorkloadRebalancer` object and try to delete it, + if a modification to `WorkloadRebalancer` occurred just right between the two time point, the previous `delete` action should be Interrupted. + +Several key implementation: +* A `WorkloadRebalancer` is judged as finished should meet two requirements: + * all expected workloads are finished with result of `Successful` or `Failed`. + * introduce a new field named `ObservedGeneration` to `Status` of WorkloadRebalancer, and it should be equal to `.metadata.Generation`, + to prevent that the WorkloadRebalancer is updated but controller hasn't in time refreshed its `Status`. +* When `WorkloadRebalancer` is `Created` or `Updated`, add it to the workqueue and calculate its expiring time, and + call `workqueue.AddAfter()` function to re-enqueue it once more if it hasn't expired. +* Before deleting the `WorkloadRebalancer`, do a final sanity check. Use the latest `WorkloadRebalancer` directly + fetched from api server to see if the TTL truly expires, rather than object from lister cache. +* When deleting the `WorkloadRebalancer`, it is needed to confirm that the `resourceVersion` of the deleted object is as expected, + to prevent from above corner case 3. + +### How to prevent application from being out-of-service + +As for disaster-recovery scenario mentioned in above [story 4](#story-4), after primary cluster recovered and reschedule +has been triggered, if new replicas in primary cluster become ready later than old replicas removed from backup cluster, +there may be no ready replica in cluster federation and the application will be out-of-service. So, how to prevent +application from being out-of-service? + +This will be discussed and implemented separately in another proposal.