diff --git a/docs/proposals/scheduling/reschedule-task/reschedule-task.md b/docs/proposals/scheduling/reschedule-task/reschedule-task.md new file mode 100644 index 000000000000..3e70fe7fd895 --- /dev/null +++ b/docs/proposals/scheduling/reschedule-task/reschedule-task.md @@ -0,0 +1,266 @@ +--- +title: Introduce a mechanism to actively triggle rescheduling +authors: + - "@chaosi-zju" +reviewers: + - "@RainbowMango" + - "@chaunceyjiang" + - "TBD" +approvers: + - "@RainbowMango" + - "TBD" + +creation-date: 2024-01-30 +--- + +# Introduce a mechanism to actively trigger rescheduling + +## Background + +According to the current implementation, after the replicas of workload is scheduled, it will remain inertia and the +replicas distribution will not change. + +However, in some scenarios, users hope to have means to actively trigger rescheduling. + +### Motivation + +Assuming the user has propagated the workloads to member clusters, replicas migrated due to member cluster failure. + +However, the user expects an approach to trigger rescheduling after member cluster restored, so that replicas can +migrate back. + +### Goals + +Introduce a mechanism to actively trigger rescheduling of workload resource. + +### Applicable scenario + +This feature might help in a scenario where: the `replicas` in resource template or `placement` in policy has not changed, +but the user wants to actively trigger rescheduling of replicas. + +## Proposal + +### Overview + +This proposal aims to introduce a mechanism of active triggering rescheduling, which benefits a lot in application +failover scenarios. This can be realized by introducing a new API, and a new field would be marked when this new API +called, so that scheduler can perceive the need for rescheduling. + +### User story + +In application failover scenarios, replicas migrated from primary cluster to backup cluster when primary cluster failue. + +As a user, I want to trigger replicas migrating back when cluster restored, so that: + +1. restore the disaster recovery mode to ensure the reliability and stability of the cluster. +2. save the cost of the backup cluster. + +### Notes/Constraints/Caveats + +This ability is limited to triggering rescheduling. The scheduling result will be recalculated according to the +Placement in the current ResourceBinding, and the scheduling result is not guaranteed to be exactly the same as before +the cluster failure. + +> Notes: pay attention to the recalculation is basing on Placement in the current `ResourceBinding`, not "Policy". So if +> your activation preference of Policy is `Lazy`, the rescheduling is still basing on previous `ResourceBinding` even if +> the current Policy has been changed. + +## Design Details + +### API change + +* Introduce a new API named `Reschedule` into a new apiGroup `command.karmada.io`: + +```go +//revive:disable:exported + +// +genclient +// +genclient:nonNamespaced +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +// Reschedule represents the desire state and status of a task which can enforces a rescheduling. +type Reschedule struct { + metav1.TypeMeta + metav1.ObjectMeta + + // Spec represents the specification of the desired behavior of Reschedule. + // +required + Spec RescheduleSpec +} + +// RescheduleSpec represents the specification of the desired behavior of Reschedule. +type RescheduleSpec struct { + // TargetRefPolicy used to select batch of resources managed by certain policies. + // +optional + TargetRefPolicy []PolicySelector + + // TargetRefResource used to select resources. + // +optional + TargetRefResource []ResourceSelector +} + +// PolicySelector the resources bound policy will be selected. +type PolicySelector struct { + // Namespace of the target policy. + // Default is empty, which means inherit from the parent object scope. + // +optional + Namespace string + + // Name of the target resource. + // Default is empty, which means selecting all resources. + // +optional + Name string +} + +// ResourceSelector the resources will be selected. +type ResourceSelector struct { + // APIVersion represents the API version of the target resources. + // +required + APIVersion string + + // Kind represents the Kind of the target resources. + // +required + Kind string + + // Namespace of the target resource. + // Default is empty, which means inherit from the parent object scope. + // +optional + Namespace string + + // Name of the target resource. + // Default is empty, which means selecting all resources. + // +optional + Name string + + // A label query over a set of resources. + // If name is not empty, labelSelector will be ignored. + // +optional + LabelSelector *metav1.LabelSelector +} + +//revive:enable:exported +``` + +* Add two new field named `ForceRescheduling` to ResourceBinding/ClusterResourceBinding + +```go +// ResourceBindingSpec represents the expectation of ResourceBinding. +type ResourceBindingSpec struct { + ... + // RescheduleTriggeredAt is a timestamp representing when the referenced resource is triggered rescheduling. + // Only when this timestamp is later than timestamp in status.rescheduledAt will the rescheduling actually execute. + // + // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC. + // It is recommended to be populated by the REST handler of command.karmada.io/Reschedule API. + // +optional + RescheduleTriggeredAt metav1.Time `json:"rescheduleTriggeredAt,omitempty"` + ... +} + +// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources. +type ResourceBindingStatus struct { + ... + // RescheduledAt is a timestamp representing scheduler finished a rescheduling. + // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC. + // +optional + RescheduledAt metav1.Time `json:"rescheduledAt,omitempty"` + ... +} +``` + +### Example + +Assuming there is a Deployment named `nginx`, the user wants to trigger its rescheduling, +he just needs to apply following yaml: + + +```yaml +apiVersion: command.karmada.io/v1alpha1 +kind: Reschedule +metadata: + name: demo-command +spec: + targetRefResource: + - apiVersion: apps/v1 + kind: Deployment + name: demo-test-1 + namespace: default + targetRefPolicy: + - name: default-pp + namespace: default +``` + +Then, he will get a `reschedule.command.karmada.io/demo-task created` result, which means the task started, attention, +not finished. Simultaneously, he will see the new field `spec.placement.rescheduleTriggeredAt` in binding of the selected +resource been set to current timestamp. + +```yaml +apiVersion: work.karmada.io/v1alpha2 +kind: ResourceBinding +metadata: + name: nginx-deployment + namespace: default +spec: + rescheduleTriggeredAt: "2024-04-17T15:04:05Z" + ... +``` + +Then, rescheduling is in progress. If it succeeds, the `status.rescheduledAt` field of binding will be updated, +which represents scheduler finished a rescheduling.; If it failed, scheduler will retry. + +```yaml +apiVersion: work.karmada.io/v1alpha2 +kind: ResourceBinding +metadata: + name: nginx-deployment + namespace: default +spec: + rescheduleTriggeredAt: "2024-04-17T15:04:05Z" + ... +status: + rescheduledAt: "2024-04-17T15:04:06Z" + conditions: + - ... + - lastTransitionTime: "2024-03-08T08:53:03Z" + message: Binding has been scheduled successfully. + reason: Success + status: "True" + type: Scheduled + - lastTransitionTime: "2024-03-08T08:53:03Z" + message: All works have been successfully applied + reason: FullyAppliedSuccess + status: "True" + type: FullyApplied +``` + +Finally, all works have been successfully applied, the user will observe changes in the actual distribution of resource +template; the user can also see several recorded event in resource template, just like: + +```shell +$ kubectl --context karmada-apiserver describe deployment demo +... +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + ... + Normal ScheduleBindingSucceed 31s default-scheduler Binding has been scheduled successfully. + Normal GetDependenciesSucceed 31s dependencies-distributor Get dependencies([]) succeed. + Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo) to cluster member1 + Normal AggregateStatusSucceed 31s (x4 over 31s) resource-binding-status-controller Update resourceBinding(default/demo-deployment) with AggregatedStatus successfully. + Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo1) to cluster member2 +``` + +### Implementation logic + +1) add an aggregated-api into karmada-aggregated-apiserver, detail described as above. + +2) add an aggregated-api handler into karmada-aggregated-apiserver, which only impletement `Create` method. It will +fetch all referred resource declared in `targetRefResource` or indirectly declared by `targetRefPolicy`, and then set +`spec.rescheduleTriggeredAt` field to current timestamp in corresponding ResourceBinding. + +> This api is no resource, not need to restore, no state, no idempotency, no implemention of `Update` or `Dalete` method. +> This is also why we not choose CRD type API. + +3) in scheduling process, add a trigger condition: even if `Placement` and `Replicas` of binding unchanged, schedule will +be triggerred if `spec.rescheduleTriggeredAt` is later than `status.rescheduledAt`. After schedule finished, scheduler +will update `status.rescheduledAt` when refreshing binding back.