Skip to content

Commit

Permalink
added readme and path
Browse files Browse the repository at this point in the history
  • Loading branch information
MackZackA committed Jul 31, 2024
1 parent ce2c48b commit 1c64732
Show file tree
Hide file tree
Showing 2 changed files with 124 additions and 2 deletions.
126 changes: 124 additions & 2 deletions docs/en/guide/pp.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,125 @@
# veScale Pipeline Parallel
# veScale Pipeline Parallel (PP)

# Coming Soon
## TLDR

<img src="../../public/guide/pp.png" alt="PP" width="700"/>

## What is PP?

`Pipeline Parallel` (`PP`) partitions layers of a model across multiple devices to form a pipelined execution of the training.
`PP` takes as input a list of microbatches of data per iteration and performs pipelined training execution (forward, backward, and optimizer update) on each microbatch, while overlaps communication with computation on each device.

## Why veScale PP?

Existing `PP` systems suffer multiple drawbacks as below, which prevent productization within a company:

- _Complex API_: assuming that model developers are also systems experts in `PP`

- _Hacking model code_: requiring manually rewrite the model code to run `PP`

- _Lacking single device abstraction_: requiring manually rewrite the training script to be `PP` device-specific

- _Lacking options of pipeline construction_: relying on a single option of graph tracing, or perfect graph tracing, or solely manual construction of the pipeline.

- _Lacking customizability of pipeline schedule_: deeply coupling the entire runtime (e.g., compute, communication) with a specific `PP` schedule (e.g., `1F1B`)

- _Lacking diverse model support_: supporting only sequential model architecture without branching, or supporting only pipeline stages having single input or single output without multiple input/output.

## What is veScale PP?

`veScale PP` offers a new `PP` framework that is both _**Easy-to-Use**_ and _**Easy-to-Customize**_, thus it is used internally in our production.
Especially, `veScale PP` provides:

- _Easy API_: hiding the complexity of `PP` systems and runtimes from model developers

- _Zero model code change_: keeping the original torch model code as it is for transparent pipelined models

- _Single device abstraction_: keeping the single device training script as it is for transparent pipelined training on multiple devices

- _Multiple options of pipeline construction_: user can flexibly choose modes:

- `GRAPH_EAGER` mode automatically traces and parses the model into a graph, splits the graph into pipeline stages, and constructs each stage for pipeline execution

- graph tracer can also be choices or users

- `MANUAL_EAGER` mode manually constructs each pipeline stage for pipeline execution, without graph tracing, parsing, and splitting.

- _Customizable pipeline schedule_: empowering users to define their custom pipeline schedules, beyond our built-in schedule as below:

- `1F1B`

- `Interleaved 1F1B`

- `Zero Bubble`

- _Support diverse models_: support comprehensive model archictures for non-sequential models, multiple-input-multiple-output stages, and etc.

## Why is veScale PP a better option than its counterparts?

- Compared with Megatron-LM's PP, `veScale PP` offers not only a better __Ease-of-Use__ experience in all aspects (easy API, zero model code, single device abstraction, options of pipeline construction) but also a plus of __Customizability__ allowing users to conveniently customize new pipeline schedules.

- Compared with DeepSpeed, `veScale PP` requires no modification of model code. It further supports multi-stage scheduling for non-sequential multimodal architecture and multi-input settings instead of being constrained by `nn.Sequential`'s syntax.

- Compared with the pre-release torchtitan, `veScale PP` provides: i) single device abstraction of training script, ii) wider options of graph tracer support, iii) wider model architecture support, and iv) guarantees bitwise accuracy alignment between `PP` and single device code.

## How does veScale PP work?

Spinning up a `PP` job typically requires three steps: i) trace and parse model graph, ii) construct pipeline stage, and iii) execute pipeline schedule. Each step is handled by `PipeParser`, `PipeModule`, and `PipeEngine`. Upon receiving the model definition, `PipeParser` (`GRAPH_EAGER` mode) breaks down the model code to the intermediate representation of low-level modules and operators up to the granularity of your choice. Under `MANUAL_EAGER` mode, users only need to assign stage modules and their communication relationships. `PipeModule` collects parameters and operators, and optimizer states belonging to the same stage, and resolves communication topology among devices. `PipeEngine` will schedule steps to execute training according to pipeline schedules.

## How to use veScale PP?

- Example of using `GRAPH_EAGER` mode:

```python
# zero model code change
class EightMLP(nn.Module):
def __init__(self, ...):
self.mlp1 = MLP(...)
...
self.mlp8 = MLP(...)
def forward(...):
...

# An EightMLP is composed of 8 submodules called MLP
model = EightMLP()
# or model = deferred_init(EightMLP)

from vescale.plan import PipelineParallelPlan, PipelineScheduleType, PipelineSplitMethodType, ModeType
from vescale.pipe import construct_pipeline_stage
from vescale.engine import PipeEngine
from vescale.dtensor.device_mesh import DeviceMesh

# create 3-dim DeviceMesh
device_mesh = DeviceMesh("cuda", [[[0]], [[1]], [[2]], [[3]]], mesh_dim_names=("PP", "DP", "TP"))

# prepare plan for pipeline parallelism
pipe_plan = PipelineParallelPlan(
mode=ModeType.GRAPH_EAGER,
split_method=PipelineSplitMethodType.MANUAL,
num_stages=4,
virtual_chunks=2,
smallest_unsplittable_units=[f"mlp{i + 1}" for i in range(8)], # maintain hierarchy of each MLP module
split_points=["mlp2", "mlp4", "mlp6", "mlp8"], # managed pipeline split points by fully qualified names
overlap_p2p_comm=True, # speedup option
schedule_type=PipelineScheduleType.INTERLEAVED_1F1B,
)

# parse model graph, split graph, and construct pipeline stage
pipe_stage = construct_pipeline_stage(model, pipe_plan, device_mesh)

# prepare pipeline schedule and execution engine
engine = PipeEngine(pipe_stage, device_mesh, pipe_plan)

# train PP model as if on single device
for minibatch_data in dataloader:
minibatch_loss, microbatch_outputs = engine(minibatch_data)
minibatch_loss.backward()
...

```

- Example of using `MANUAL_EAGER` mode: Coming Soon.

- APIs can be found in `<repo>/vescale/pipe/pipe_stage.py` and `<repo>/vescale/pipe/pipe.py`

- More examples can be found in `<repo>/test/parallel/pipeline/api/test_simple_api.py`
Binary file added docs/public/guide/pp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 1c64732

Please sign in to comment.