Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #19

Open
wants to merge 1 commit into
base: feature/updateDocs
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ Firstly, it will update the host config of `TrainingJob`,
In horovod elastic mode, it needs a script that return the host's topology , the change of hosts will notify the launcher, then and it will shutdown the worker process not in hosts gracefully.

After the hostFile updated, `et-operator` start to detect whether the launch process exist,
when `et-operator` confirm that the scalein worker's launch process not exit, it will delete the worker's resource.
when `et-operator` confirm that the scalein worker's launch process not exist, it will delete the worker's resource.

![ScaleIn](./docs/images/scalein.png)


#### ScaleOut
In `ScaleOut` CR, we can specify the trainingJob's name and the count that we want to scaleout.
When `et-operator` start to execute the scalein operation,
When `et-operator` start to execute the scaleout operation,
different from scaleIn, it will firstly create the new worker's resources.
After worker's resources ready, then update the hostFile.

Expand Down