Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulator: record events on prod cluster and replay them on a fake cluster any time #395

Open
saza-ku opened this issue Nov 25, 2024 · 6 comments · May be fixed by #403
Open

simulator: record events on prod cluster and replay them on a fake cluster any time #395

saza-ku opened this issue Nov 25, 2024 · 6 comments · May be fixed by #403
Labels
area/simulator Issues or PRs related to the simulator. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@saza-ku
Copy link
Contributor

saza-ku commented Nov 25, 2024

/kind feature


This issue proposes a new feature to record events on a prod cluster and replay them on a fake cluster any time.

Background

Debugging customized schedulers is a complex challenge. One of the reasons is bugs that only occur on the prod cluster. It is hard to reproduce the issue on a fake cluster because the fake cluster does not have the same load as the prod cluster.

We have the syncer feature that makes it easy to simulate a real load on a fake cluster. However, it would be more helpful to save a series of events on the prod cluster that cause the issue and replay them on a fake cluster any time especially for debugging.

Goals

Users can run a process that watches events on the prod cluster and saves them in some way (e.g. a JSON file). Then they can run the simulator and replay the events on a fake cluster.

User Stories

Story 1

An organization has implemented their own scheduler plugins. The plugins cause an issue only on the prod cluster. They want to reproduce the issue on a fake cluster.

Solution

They can record the events that cause the issue on the prod cluster. Then they can replay the events on a fake cluster to reproduce the issue.

Story 2

They have implemented a new plugin. They want to test and evaluate it with a real load before deploying it to the prod cluster.

Solution

They can record the events on the prod cluster and save them. When they implement a new plugin, they can use the recorded events to test and evaluate the plugin.

Note

This might be a fairly large feature, so please let me know if we need a KEP.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 25, 2024
@sanposhiho
Copy link
Member

+1, thanks for a proposal!

This might be a fairly large feature, so please let me know if we need a KEP.

No need. Please just go ahead :)

@sanposhiho
Copy link
Member

/area simulator

@k8s-ci-robot k8s-ci-robot added the area/simulator Issues or PRs related to the simulator. label Nov 26, 2024
@saza-ku
Copy link
Contributor Author

saza-ku commented Dec 5, 2024

Thanks:)

When replaying the events, the logic of applying resources will be the same as that of syncer. So I gonna make the logic reusable by making a new package.

But oneshotimporter also has the codes of applying resources. So how about modularizing the codes in oneshotimporter and syncer before implementing this feature?

@sanposhiho
Copy link
Member

sg, I think the applying and stuff can be in a certain package, and oneshot/syncer/replayer can use it.

@saza-ku
Copy link
Contributor Author

saza-ku commented Dec 5, 2024

Okay, I'll do it first and make a PR.

@saza-ku
Copy link
Contributor Author

saza-ku commented Dec 16, 2024

#376 is going to change oneshotimporter, so I first separated resourceapplier from syncer (#400).

Next I'll implement the replaying feature using it before fixing oneshotimporter.

@saza-ku saza-ku linked a pull request Jan 21, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/simulator Issues or PRs related to the simulator. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants