Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource #4698

Merged
merged 1 commit into from
May 24, 2024

Conversation

chaosi-zju
Copy link
Member

@chaosi-zju chaosi-zju commented Mar 12, 2024

What type of PR is this?

/kind design
/kind documentation

What this PR does / why we need it:

Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource.

Assuming the user has propagated the workloads to member clusters, in some scenarios the current replicas distribution
is not the most expected, such as:

  • replicas migrated due to cluster failover, while now cluster recovered.
  • replicas migrated due to application-level failover, while now each cluster has sufficient resources to run the replicas.
  • as for Aggregated schedule strategy, replicas were initially distributed across multiple clusters due to resource
    constraints, but now one cluster is enough to accommodate all replicas.

Therefore, the user desires for an approach to trigger rescheduling so that the replicas distribution can do a rebalance.

Which issue(s) this PR fixes:

Fixes part of #4840

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@karmada-bot karmada-bot added the kind/design Categorizes issue or PR as related to design. label Mar 12, 2024
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 12, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.33%. Comparing base (5bc8c54) to head (0e1922c).
Report is 113 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4698      +/-   ##
==========================================
+ Coverage   53.12%   53.33%   +0.20%     
==========================================
  Files         251      252       +1     
  Lines       20417    20482      +65     
==========================================
+ Hits        10847    10924      +77     
+ Misses       8856     8836      -20     
- Partials      714      722       +8     
Flag Coverage Δ
unittests 53.33% <ø> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wu0407
Copy link
Contributor

wu0407 commented Mar 12, 2024

This Pr mixes fault self-healing and rescheduling. I think fault self-healing includes rescheduling, similar to when a node crashes, the workload corresponding to the pod on the node will regenerate the pod. This is completed by multiple controllers working together, including a scheduler. If the goal is self-healing, then multiple components need to be considered for coordination. If it is only rescheduling, then only the target of eviction and the conditions for stopping eviction need to be considered. Can we consider the design concept of the Descheduler project in the community

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@chaosi-zju chaosi-zju changed the title Introduce a mechanism to actively trigger rescheduling Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource May 9, 2024
@chaosi-zju
Copy link
Member Author

I did a hard job to made a thorough improvement of this proposal, now everyone can go through it all over again, looking forward to your suggestions~

@chaosi-zju
Copy link
Member Author

This Pr mixes fault self-healing and rescheduling.

@wu0407 Hello, I have updated this proposal. Actually, this proposal is about an entirely rescheduling, as for cluster failover is only a user story of it. For more imformation you can see in latest proposal, thank you for your comments~

@chaosi-zju chaosi-zju force-pushed the reschedule branch 2 times, most recently from 1e4b127 to e7aff2a Compare May 22, 2024 09:51
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/design Categorizes issue or PR as related to design. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants