Skip to content

Commit

Permalink
Introduce a mechanism to actively trigger rescheduling
Browse files Browse the repository at this point in the history
Signed-off-by: chaosi-zju <[email protected]>
  • Loading branch information
chaosi-zju committed Apr 17, 2024
1 parent dca5c1a commit c57f463
Showing 1 changed file with 266 additions and 0 deletions.
266 changes: 266 additions & 0 deletions docs/proposals/scheduling/reschedule-task/reschedule-task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
---
title: Introduce a mechanism to actively triggle rescheduling
authors:
- "@chaosi-zju"
reviewers:
- "@RainbowMango"
- "@chaunceyjiang"
- "TBD"
approvers:
- "@RainbowMango"
- "TBD"

creation-date: 2024-01-30
---

# Introduce a mechanism to actively trigger rescheduling

## Background

According to the current implementation, after the replicas of workload is scheduled, it will remain inertia and the
replicas distribution will not change.

However, in some scenarios, users hope to have means to actively trigger rescheduling.

### Motivation

Assuming the user has propagated the workloads to member clusters, replicas migrated due to member cluster failure.

However, the user expects an approach to trigger rescheduling after member cluster restored, so that replicas can
migrate back.

### Goals

Introduce a mechanism to actively trigger rescheduling of workload resource.

### Applicable scenario

This feature might help in a scenario where: the `replicas` in resource template or `placement` in policy has not changed,
but the user wants to actively trigger rescheduling of replicas.

## Proposal

### Overview

This proposal aims to introduce a mechanism of active triggering rescheduling, which benefits a lot in application
failover scenarios. This can be realized by introducing a new API, and a new field would be marked when this new API
called, so that scheduler can perceive the need for rescheduling.

### User story

In application failover scenarios, replicas migrated from primary cluster to backup cluster when primary cluster failue.

As a user, I want to trigger replicas migrating back when cluster restored, so that:

1. restore the disaster recovery mode to ensure the reliability and stability of the cluster.
2. save the cost of the backup cluster.

### Notes/Constraints/Caveats

This ability is limited to triggering rescheduling. The scheduling result will be recalculated according to the
Placement in the current ResourceBinding, and the scheduling result is not guaranteed to be exactly the same as before
the cluster failure.

> Notes: pay attention to the recalculation is basing on Placement in the current `ResourceBinding`, not "Policy". So if
> your activation preference of Policy is `Lazy`, the rescheduling is still basing on previous `ResourceBinding` even if
> the current Policy has been changed.
## Design Details

### API change

* Introduce a new API named `Reschedule` into a new apiGroup `command.karmada.io`:

```go
//revive:disable:exported

// +genclient
// +genclient:nonNamespaced
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// Reschedule represents the desire state and status of a task which can enforces a rescheduling.
type Reschedule struct {
metav1.TypeMeta
metav1.ObjectMeta

// Spec represents the specification of the desired behavior of Reschedule.
// +required
Spec RescheduleSpec
}

// RescheduleSpec represents the specification of the desired behavior of Reschedule.
type RescheduleSpec struct {
// TargetRefPolicy used to select batch of resources managed by certain policies.
// +optional
TargetRefPolicy []PolicySelector

// TargetRefResource used to select resources.
// +optional
TargetRefResource []ResourceSelector
}

// PolicySelector the resources bound policy will be selected.
type PolicySelector struct {
// Namespace of the target policy.
// Default is empty, which means inherit from the parent object scope.
// +optional
Namespace string

// Name of the target resource.
// Default is empty, which means selecting all resources.
// +optional
Name string
}

// ResourceSelector the resources will be selected.
type ResourceSelector struct {
// APIVersion represents the API version of the target resources.
// +required
APIVersion string

// Kind represents the Kind of the target resources.
// +required
Kind string

// Namespace of the target resource.
// Default is empty, which means inherit from the parent object scope.
// +optional
Namespace string

// Name of the target resource.
// Default is empty, which means selecting all resources.
// +optional
Name string

// A label query over a set of resources.
// If name is not empty, labelSelector will be ignored.
// +optional
LabelSelector *metav1.LabelSelector
}

//revive:enable:exported
```

* Add two new field named `ForceRescheduling` to ResourceBinding/ClusterResourceBinding

```go
// ResourceBindingSpec represents the expectation of ResourceBinding.
type ResourceBindingSpec struct {
...
// RescheduleTriggeredAt is a timestamp representing when the referenced resource is triggered rescheduling.
// Only when this timestamp is later than timestamp in status.rescheduledAt will the rescheduling actually execute.
//
// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
// It is recommended to be populated by the REST handler of command.karmada.io/Reschedule API.
// +optional
RescheduleTriggeredAt metav1.Time `json:"rescheduleTriggeredAt,omitempty"`
...
}

// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources.
type ResourceBindingStatus struct {
...
// RescheduledAt is a timestamp representing scheduler finished a rescheduling.
// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
// +optional
RescheduledAt metav1.Time `json:"rescheduledAt,omitempty"`
...
}
```

### Example

Assuming there is a Deployment named `nginx`, the user wants to trigger its rescheduling,
he just needs to apply following yaml:


```yaml
apiVersion: command.karmada.io/v1alpha1
kind: Reschedule
metadata:
name: demo-command
spec:
targetRefResource:
- apiVersion: apps/v1
kind: Deployment
name: demo-test-1
namespace: default
targetRefPolicy:
- name: default-pp
namespace: default
```
Then, he will get a `reschedule.command.karmada.io/demo-task created` result, which means the task started, attention,
not finished. Simultaneously, he will see the new field `spec.placement.rescheduleTriggeredAt` in binding of the selected
resource been set to current timestamp.

```yaml
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
name: nginx-deployment
namespace: default
spec:
rescheduleTriggeredAt: "2024-04-17T15:04:05Z"
...
```

Then, rescheduling is in progress. If it succeeds, the `status.rescheduledAt` field of binding will be updated,
which represents scheduler finished a rescheduling.; If it failed, scheduler will retry.

```yaml
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
name: nginx-deployment
namespace: default
spec:
rescheduleTriggeredAt: "2024-04-17T15:04:05Z"
...
status:
rescheduledAt: "2024-04-17T15:04:06Z"
conditions:
- ...
- lastTransitionTime: "2024-03-08T08:53:03Z"
message: Binding has been scheduled successfully.
reason: Success
status: "True"
type: Scheduled
- lastTransitionTime: "2024-03-08T08:53:03Z"
message: All works have been successfully applied
reason: FullyAppliedSuccess
status: "True"
type: FullyApplied
```

Finally, all works have been successfully applied, the user will observe changes in the actual distribution of resource
template; the user can also see several recorded event in resource template, just like:

```shell
$ kubectl --context karmada-apiserver describe deployment demo
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal ScheduleBindingSucceed 31s default-scheduler Binding has been scheduled successfully.
Normal GetDependenciesSucceed 31s dependencies-distributor Get dependencies([]) succeed.
Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo) to cluster member1
Normal AggregateStatusSucceed 31s (x4 over 31s) resource-binding-status-controller Update resourceBinding(default/demo-deployment) with AggregatedStatus successfully.
Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo1) to cluster member2
```

### Implementation logic

1) add an aggregated-api into karmada-aggregated-apiserver, detail described as above.

2) add an aggregated-api handler into karmada-aggregated-apiserver, which only impletement `Create` method. It will
fetch all referred resource declared in `targetRefResource` or indirectly declared by `targetRefPolicy`, and then set
`spec.rescheduleTriggeredAt` field to current timestamp in corresponding ResourceBinding.

> This api is no resource, not need to restore, no state, no idempotency, no implemention of `Update` or `Dalete` method.
> This is also why we not choose CRD type API.

3) in scheduling process, add a trigger condition: even if `Placement` and `Replicas` of binding unchanged, schedule will
be triggerred if `spec.rescheduleTriggeredAt` is later than `status.rescheduledAt`. After schedule finished, scheduler
will update `status.rescheduledAt` when refreshing binding back.

0 comments on commit c57f463

Please sign in to comment.