Seq1F1B: Efficient Pipeline Parallelism for Long Sequence Training

Welcome to the Seq1F1B repository! Seq1F1B is an efficient sequence-level 1F1B (one-forward-one-backward) pipeline scheduling method designed to enhance the distributed training of large language models (LLMs) on long sequences. This method builds upon the Megatron-LM framework and introduces novel strategies to optimize memory usage and reduce pipeline bubbles.

Introduction

In recent years, pipeline parallelism has become essential for the distributed training of LLMs. However, as the length of training sequences reaches 32k and even 128k, existing pipeline parallel methods often encounter high memory footprints and significant bubble sizes, negatively impacting training efficiency. Seq1F1B addresses these challenges by decomposing batch-level schedulable units into finer-grained sequence-level units, thereby improving workload balance and reducing memory usage.

Seq1F1B supports efficient training of a 30B parameter LLM on sequences up to 64k using 64 NVIDIA A100 GPUs without requiring recomputation strategies. This level of performance is unattainable with current pipeline parallel methods.

Key Features

Fine-Grained Schedulable Units: Decomposes batch-level units into sequence-level units to enhance workload balance.
Reduced Memory Footprint: Optimized memory usage allows for training on longer sequences.
Smaller Pipeline Bubbles: Efficient scheduling reduces idle time during pipeline stages.
Scalable Training: Capable of training LLMs with up to 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs.
Based on Megatron-LM: Built upon the robust and widely-used Megatron-LM.


Seq1F1B timeline


Seq1F1B-I timeline


Seq1F1B-ZB timeline

Installation

To get started with Seq1F1B, follow these steps:

Clone the repository:

git clone https://github.com/your-repo/Seq1F1B.git
cd Seq1F1B

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Seq1F1B integrates seamlessly with Megatron-LM. Here's an example of how to run training with Seq1F1B:

Prepare your dataset and configuration files. Here, we take codeparrot as an example:

from datasets import load_dataset
train_data = load_dataset('codeparrot/codeparrot-clean-train', split='train')
train_data.to_json("codeparrot_data.json", lines=True)

pip install nltk
   python tools/preprocess_data.py \
    --input codeparrot_data.json \
    --output-prefix codeparrot \
    --vocab vocab.json \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file merges.txt \
    --json-keys content \
    --workers 32 \
    --chunk-size 25 \
    --append-eod

Run the training script with Seq1F1B pipeline parallelism:
```
bash exp.sh 
```

For detailed usage instructions and configuration options, please refer to our documentation.

Results

Seq1F1B demonstrates significant improvements over existing methods:

Memory Efficiency: Reduced memory footprint allows training on longer sequences without recomputation.
Performance: Achieves smaller pipeline bubbles, resulting in faster training times.
Scalability: Supports large-scale training on modern GPU clusters.


Memory usage comparison between Seq1F1B and existing methods.

Contributing

We welcome contributions from the community! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

Seq1F1B is released under original Megatron's License. See the LICENSE file for more details.

Acknowledgments

Seq1F1B builds upon the Megatron-LM framework. We thank the Megatron-LM development team for their excellent work and support.

Future Work

We plan to release our code and further improvements to advance the training of LLMs on long sequences. Stay tuned for updates!

Citation

If you use this codebase, or otherwise found our work valuable, please cite:

@article{ao2024seq1f1b,
  title={Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training},
  author={Ao, Sun and Zhao, Weilin and Han, Xu and Yang, Cheng and Liu, Zhiyuan and Shi, Chuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2406.03488},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,346 Commits
.github		.github
docs		docs
examples		examples
images		images
megatron		megatron
picture		picture
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
exp.sh		exp.sh
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_gpt_core.py		pretrain_gpt_core.py
pretrain_ict.py		pretrain_ict.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pyproject.toml		pyproject.toml
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seq1F1B: Efficient Pipeline Parallelism for Long Sequence Training

Introduction

Key Features

Installation

Usage

Results

Contributing

License

Acknowledgments

Future Work

Citation

About

Releases

Packages

Languages

License

thunlp/Seq1F1B

Folders and files

Latest commit

History

Repository files navigation

Seq1F1B: Efficient Pipeline Parallelism for Long Sequence Training

Introduction

Key Features

Installation

Usage

Results

Contributing

License

Acknowledgments

Future Work

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages