-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a dev doc for resuming SDG runs #136
Draft
bbrowning
wants to merge
5
commits into
instructlab:main
Choose a base branch
from
bbrowning:sdg-resume
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
fd47916
Add a dev doc for resuming SDG runs
bbrowning 236b087
Make linter and spellcheck happy
bbrowning d9cca57
Add some open questions to sdg-resume.md
bbrowning 49d6f73
Add a diagram of proposed sdg resume flow
bbrowning 507ab4a
Forgot to end my mermaid diagram block...
bbrowning File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
# Resuming Interrupted Synthetic Data Generation Runs | ||
|
||
| | | | ||
| -------------- | ---------- | | ||
| Date | 2024-09-19 | | ||
| Scope | Synthetic Data Generation | | ||
| Status | Draft | | ||
| Authors | [Ben Browning](@bbrowning) | | ||
|
||
## What | ||
|
||
Synthetic Data Generation (SDG) should track its progress and allow | ||
users to resume interrupted runs without needing to start over from | ||
the beginning. | ||
|
||
## Why | ||
|
||
SDG jobs are long-running, and may occasionally get interrupted. This | ||
could be because it's running on a cloud spot instance, on a person's | ||
laptop or machine that needs to shut down for the day, or perhaps a | ||
GPU hardware failure, out of storage condition, or numerous other | ||
reasons that could interrupt a run and cause the Python processes or | ||
entire machine to die. | ||
|
||
Because these are long-running jobs that can take hours (or even days, | ||
in some cases), being able to resume from where we left off is | ||
important from a usability standpoint. Otherwise, interrupted SDG | ||
causes a large usability issue and wall clock time for re-work. | ||
|
||
## Goals | ||
|
||
The user should be able to resume from any long-running point of the | ||
generation process from where it left off. This includes resuming | ||
after individual blocks of a pipeline, even if the entire pipeline has | ||
not finished yet. | ||
|
||
The user should have a command like `ilab data generate --resume` to | ||
explicitly indicate they want to resume an interrupted SDG run. This | ||
is just an example, and the exact flag/CLI invocation to trigger this | ||
should be discussed. | ||
|
||
The following types of interruption should be handled: | ||
|
||
* user-triggered interruptions (e.g. CTRL-c) | ||
* SIGINT interruptions (e.g. when running on a spot instance of Pod | ||
which receives SIGINT) | ||
* system hardware failure (GPU failure, power loss) | ||
|
||
## Non-Goals | ||
|
||
This is not replacing checkpoints. Checkpoints are how we save each | ||
batch of generated samples after the entire pipeline has finished for | ||
that batch. This issue is about allowing resumption at any point after | ||
interruption, instead of only recovering entirely finished batches | ||
which get saved as checkpoints. | ||
|
||
## How | ||
|
||
The primary way we'll enable this is with additional metadata tracking | ||
and writing of intermediate state to disk instead of retaining it only | ||
in-memory during a pipeline's execution. | ||
|
||
Every run already has a `node_datasets_<datetime>` subdirectory where | ||
we store the generated samples for every taxonomy leaf node. We'll | ||
start by writing a new yaml file here containing the metadata of this | ||
generation run. This will include anything we need to know to resume | ||
the run, which may non-exhaustively include the taxonomy-path, | ||
taxonomy-base, model, chunk-word-count, sdg-scale-factor, pipeline | ||
(simple vs full vs path), batch-size, leaf nodes being generated, and | ||
perhaps even an entire copy of the pipeline config in use. We need | ||
enough metadata so that an `ilab data generate --resume` can resume | ||
everything exactly as it was before, without the need to pass in each | ||
set of original parameters. We will retain this metadata file even | ||
after the generation run completes. | ||
|
||
Next, we'll create a subdirectory under `node_datasets_<datetime>` | ||
called `intermediate_state` where we'll write all of our intermediate | ||
state files. After a successful generation run, we'll delete this | ||
`intermediate_state` subdirectory. | ||
|
||
A diagram of the proposed flow for a pipeline that contains two | ||
blocks: | ||
|
||
```mermaid | ||
flowchart TD | ||
datagen[Data Generation] --> is_resuming{Resuming?} | ||
is_resuming -- No --> write_gen_config[Write Generation Config] | ||
is_resuming -- Yes --> load_gen_config[Load Generation Config] | ||
write_gen_config --> start_all_nodes[/For Each Leaf Node\] | ||
load_gen_config --> start_all_nodes | ||
start_all_nodes --> found_leaf_samples{Found Taxonomy Node Input Samples?} | ||
subgraph For Each Leaf Node | ||
found_leaf_samples -- No --> write_leaf_samples[Write Taxonomy Node Input Samples] | ||
found_leaf_samples -- Yes --> load_leaf_samples[Load Taxonomy Node Input Samples] | ||
write_leaf_samples --> found_block1_samples{Found Pipeline Block 1 Samples?} | ||
load_leaf_samples --> found_block1_samples | ||
found_block1_samples -- No --> write_block1_samples[Generate Pipeline Block1 Samples] | ||
found_block1_samples -- Yes --> load_block1_samples[Load Pipeline Block 1 Samples] | ||
write_block1_samples --> found_block2_samples{Found Pipeline Block 2 Samples?} | ||
load_block1_samples --> found_block2_samples | ||
found_block2_samples -- No --> write_block2_samples[Generate Pipeline Block2 Samples] | ||
found_block2_samples -- Yes --> load_block2_samples[Load Pipeline Block 2 Samples] | ||
write_block2_samples --> found_eval_task_data{Found Eval Task Data?} | ||
load_block2_samples --> found_eval_task_data | ||
found_eval_task_data -- No --> write_eval_task_data[Generate Eval Task Data] | ||
found_eval_task_data -- Yes --> load_eval_task_data[Load Eval Task Data] | ||
end | ||
write_eval_task_data --> wait_all_nodes[\Wait For All Leaf Nodes to Finish/] | ||
load_eval_task_data --> wait_all_nodes | ||
wait_all_nodes --> clean_state[Clean Intermediate State] | ||
clean_state --> finished[Generation Complete] | ||
``` | ||
|
||
## Open Questions | ||
|
||
* What's the format of the metadata config file? | ||
* Can a user resume any interrupted run? Or only the latest? | ||
* What do we do if the user explicitly passes in some CLI args that | ||
don't match what they initially used to start the generation run the | ||
first time? Log a warning? Error out? | ||
* What are all the intermediate state files we need? | ||
* The list of all taxonomy leaf nodes in this run and their content? | ||
* Initial samples from each leaf node (not generated samples, but | ||
input samples) | ||
* The output from each pipeline step | ||
* Should resume be explicit? Or implicit if we detect you have an | ||
unfinished generate run that used the same set of parameters? | ||
(Thinking about a K8s pod that gets killed and restarted by | ||
Kubernetes - it will get restarted with the exact same parameters vs | ||
knowing it explicitly needs to "resume" the second time) | ||
|
||
## Alternatives | ||
|
||
## Risks | ||
|
||
## Stakeholder Impacts | ||
|
||
| Group | Key Contact | Date | Impacted? | | ||
| ----------------------------- | --------------- | ---------- | --------- | | ||
| InstructLab CLI Team | ? | ? | ? | | ||
|
||
## References | ||
|
||
* [SDG Issue 267](https://github.com/instructlab/sdg/issues/267) | ||
|
||
## Reviews |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to give the context of how long a batch might take for comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea. And I also need to talk about how checkpoints persist across individual data generation runs, while this is focused on the ability to resume a specific interrupted SDG run.