Skip to content

Commit

Permalink
cleaned pages and added specInfer (#5)
Browse files Browse the repository at this point in the history
  • Loading branch information
yingyee0111 authored Jun 26, 2023
1 parent 97a968b commit 889a925
Show file tree
Hide file tree
Showing 11 changed files with 163 additions and 100 deletions.
7 changes: 6 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
source "https://rubygems.org"
gemspec
gemspec
gem 'jekyll-include-cache'
gem 'jekyll-feed'
gem 'jekyll-gist'
gem 'jekyll-sitemap'
gem 'jekyll-paginate'
7 changes: 5 additions & 2 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,12 +108,15 @@ author:
#- label: "Twitter"
# icon: "fab fa-fw fa-twitter-square"
# url: "https://twitter.com/"
#- label: "Facebook"
# - label: "Facebook"
# icon: "fab fa-fw fa-facebook-square"
# url: "https://facebook.com/"
# url: "https://facebook.com/"
- label: "FlexFlow GitHub"
icon: "fab fa-fw fa-github"
url: "https://github.com/flexflow/flexflow"
- label: "FlexFlow Documentation"
icon: "fas fa-fw fa-sticky-note"
url: "http://flexflow.readthedocs.io/"
#- label: "Instagram"
# icon: "fab fa-fw fa-instagram"
# url: "https://instagram.com/"
Expand Down
11 changes: 2 additions & 9 deletions _data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,7 @@
main:
- title: "About"
url: /about/
- title: "Bootcamp 2020"
url: /bootcamp/
- title: "Get Started"
url: /start/
- title: "Autotuning"
url: /search/
- title: "Keras Support"
url: /keras/
- title: "GNN"
url: /gnn/

- title: "SpecInfer"
url: /specInfer/
9 changes: 3 additions & 6 deletions _pages/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,7 @@ FlexFlow provides the following key features:

* **Flexible Parallelization**. FlexFlow supports parallelizing DNN training through combinations of the [Sample, Operator, Attribute, and Parameter](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) dimensions, and guarantees that different parallelization strategies maintain the same model accuracy by design.

* **Performance Autotuning**. To accelerate DNN training on a specific parallel machine, FlexFlow uses guided randomized search to automatically find fast parallelization strategies while requiring no manual effort.

* **Keras Support**. FlexFlow offers a drop-in replacement for TensorFlow Keras and transparently accelerates existing Keras programs by discovering faster parallelization strategies.

* **Large-Scale GNNs**. FlexFlow enables fast graph neural network (GNN) training on large graphs (e.g., billion-edge) by distributing GNN computations across multiple GPUs (potentially on multiple compute nodes) using [attribute parallelism](https://cs.stanford.edu/~zhihao/papers/mlsys20.pdf).

* **Joint Optimization**. FlexFlow uses a novel hierarchical search algorithm
to jointly optimize [algebraic transformations and parallelization](https://www.cs.cmu.edu/~zhihaoj2/papers/unity_osdi22.pdf) while maintaining scalability.

* **Speculative Inference**. FlexFlow accelerates generative LLM inference with [speculative inference and token tree verification](https://arxiv.org/abs/2305.09781).
32 changes: 0 additions & 32 deletions _pages/bootcamp.md

This file was deleted.

45 changes: 45 additions & 0 deletions _pages/specInfer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: SpecInfer
layout: single
permalink: /specInfer/
classes: wide
#toc: true
#toc_sticky: true
author_profile: true
header:
overlay_image: /assets/images/header.jpg
---

# SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

## What is SpecInfer
The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. SpecInfer is an open-source distributed multi-GPU system that accelerates generative LLM inference with speculative inference and token tree verification.

<figure>
<img src="/assets/images/spec_infer_demo.gif">
</figure>

A key insight behind SpecInfer is to combine various collectively boost-tuned small speculative models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM’s output in parallel using a novel tree-based parallel decoding mechanism.

<figure>
<img src="/assets/images/spec_infer_overview.png">
</figure>

SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which largely reduces the end-to-end inference latency and computational requirement for serving generative LLMs while provably preserving model quality.

<p align="center">
<img align="center" src="/assets/images/spec_infer_performance.png" width="500px" />
</p>

## Build/Install SpecInfer
SpecInfer is built on top of FlexFlow. You can build/install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](https://github.com/flexflow/FlexFlow/blob/master/INSTALL.md) for building/installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually.

## Run SpecInfer
The source code of the SpecInfer pipeline is available at [this GitHub folder](https://github.com/flexflow/FlexFlow/tree/inference/inference/spec_infer). The SpecInfer executable will be available at `/build_dir/inference/spec_infer/spec_infer` at compilation.

You may refer to our [GitHub page](https://github.com/flexflow/FlexFlow/blob/inference/.github/README.md) for details on examples, tokenizers support, mixed-precision support and more.

## Paper
This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting SpecInfer and the underlying FlexFlow runtime system. The following paper describes design, implementation, and key optimizations of SpecInfer.

* Xupeng Miao*, Gabriele Oliaro*, Zhihao Zhang*, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. [SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification](https://arxiv.org/abs/2305.09781).
132 changes: 92 additions & 40 deletions _pages/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ header:
overlay_image: /assets/images/header.jpg

---
FlexFlow can be built from source code using the following instructions.

# Installing FlexFlow
FlexFlow can be built from source code using the following instructions.

## Prerequisties
* [CUDNN](https://developer.nvidia.com/cudnn) is used to perform low-level operations.
Expand All @@ -25,66 +26,117 @@ Download and install CUDNN locally.

* (Optional) [GASNet](http://gasnet.lbl.gov) is used for multi-node executions. (see [GASNet installation instructions](http://legion.stanford.edu/gasnet))

## Build the FlexFlow Runtime

* To get started, clone the FlexFlow source code from the stable branch on github.
## 1. Download the source code
Clone the FlexFlow source code, and the third-party dependencies from GitHub.
```
git clone -b r20.08 --recursive https://github.com/flexflow/FlexFlow.git
cd FlexFlow
git clone --recursive https://github.com/flexflow/FlexFlow.git
```
The `FF_HOME` environment variable is used for building and running FlexFlow. You can add the following line in `~/.bashrc`.

## 2. Install system dependencies
FlexFlow has system dependencies on cuda and/or rocm depending on which gpu backend you target. The gpu backend is configured by the cmake variable `FF_GPU_BACKEND`. By default, FlexFlow targets CUDA. `docker/base/Dockerfile` installs system dependencies in a standard ubuntu system.

### Targeting CUDA - `FF_GPU_BACKEND=cuda`
If you are targeting CUDA, FlexFlow requires CUDA and CUDNN to be installed. You can follow the standard nvidia installation instructions [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) and [CUDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html).

Disclaimer: CUDA architectures < 60 (Maxwell and older) are no longer supported.

### Targeting ROCM - `FF_GPU_BACKEND=hip_rocm`
If you are targeting ROCM, FlexFlow requires a ROCM and HIP installation with a few additional packages. Note that this can be done on a system with or without an AMD GPU. You can follow the standard installation instructions [ROCM](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.3/page/Introduction_to_ROCm_Installation_Guide_for_Linux.html) and [HIP](https://docs.amd.com/bundle/HIP-Installation-Guide-v5.3/page/Introduction_to_HIP_Installation_Guide.html). When running `amdgpu-install`, install the use cases hip and rocm. You can avoid installing the kernel drivers (not necessary on systems without an AMD graphics card) with `--no-dkms` I.e. `amdgpu-install --usecase=hip,rocm --no-dkms`. Additionally, install the packages `hip-dev`, `hipblas`, `miopen-hip`, and `rocm-hip-sdk`.

See `./docker/base/Dockerfile` for an example ROCM install.

### Targeting CUDA through HIP - `FF_GPU_BACKEND=hip_cuda`
This is not currently supported.

## 3. Install the Python dependencies
If you are planning to build the Python interface, you will need to install several additional Python libraries, please check [this](https://github.com/flexflow/FlexFlow/blob/master/requirements.txt) for details. If you are only looking to use the C++ interface, you can skip to the next section.

**We recommend that you create your own `conda` environment and then install the Python dependencies, to avoid any version mismatching with your system pre-installed libraries.**

The `conda` environment can be created and activated as:
```
export FF_HOME=/path/to/FlexFlow
conda env create -f conda/environment.yml
conda activate flexflow
```

* Build the Protocol Buffer library.
Skip this step if the Protocol Buffer library is already installed.
## 4. Configuring the FlexFlow build
You can configure a FlexFlow build by running the `config/config.linux` file in the build folder. If you do not want to build with the default options, you can set your configurations by passing (or exporting) the relevant environment variables. We recommend that you spend some time familiarizing with the available options by scanning the `config/config.linux` file. In particular, the main parameters are:

1. `CUDA_DIR` is used to specify the directory of CUDA. It is only required when CMake can not automatically detect the installation directory of CUDA.
2. `CUDNN_DIR` is used to specify the directory of CUDNN. It is only required when CUDNN is not installed in the CUDA directory.
3. `FF_CUDA_ARCH` is used to set the architecture of targeted GPUs, for example, the value can be 60 if the GPU architecture is Pascal. To build for more than one architecture, pass a list of comma separated values (e.g. `FF_CUDA_ARCH=70,75`). To compile FlexFlow for all GPU architectures that are detected on the machine, pass `FF_CUDA_ARCH=autodetect` (this is the default value, so you can also leave `FF_CUDA_ARCH` unset. If you want to build for all GPU architectures compatible with FlexFlow, pass `FF_CUDA_ARCH=all`. **If your machine does not have any GPU, you have to set FF_CUDA_ARCH to at least one valid architecture code (or `all`)**, since the compiler won't be able to detect the architecture(s) automatically.
4. `FF_USE_PYTHON` controls whether to build the FlexFlow Python interface.
5. `FF_USE_NCCL` controls whether to build FlexFlow with NCCL support. By default, it is set to ON.
6. `FF_LEGION_NETWORKS` is used to enable distributed run of FlexFlow. If you want to run FlexFlow on multiple nodes, follow instructions in [MULTI-NODE.md](MULTI-NODE.md) and set the corresponding parameters as follows:
* To build FlexFlow with GASNet, set `FF_LEGION_NETWORKS=gasnet` and `FF_GASNET_CONDUIT` as a specific conduit (e.g. `ibv`, `mpi`, `udp`, `ucx`) in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
* To build FlexFlow with native UCX, set `FF_LEGION_NETWORKS=ucx` in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
8. `FF_BUILD_EXAMPLES` controls whether to build all C++ example programs.
9. `FF_MAX_DIM` is used to set the maximum dimension of tensors, by default it is set to 4.
10. `FF_USE_{NCCL,LEGION,ALL}_PRECOMPILED_LIBRARY`, controls whether to build FlexFlow using a pre-compiled version of the Legion, NCCL (if `FF_USE_NCCL` is `ON`), or both libraries . By default, `FF_USE_NCCL_PRECOMPILED_LIBRARY` and `FF_USE_LEGION_PRECOMPILED_LIBRARY` are both set to `ON`, allowing you to build FlexFlow faster. If you want to build Legion and NCCL from source, set them to `OFF`.

More options are available in cmake, please run `ccmake` and search for options starting with FF.

## 5. Build FlexFlow
You can build FlexFlow in three ways: with CMake, with Make, and with `pip`. We recommend that you use the CMake building system as it will automatically build all C++ dependencies inlcuding NCCL and Legion.

### Building FlexFlow with CMake
To build FlexFlow with CMake, go to the FlexFlow home directory, and run
```
cd protobuf
./autogen.sh
./configure
make
mkdir build
cd build
../config/config.linux
make -j N
```
* Build the NCCL library. (If using NCCL for parameter synchornization.)
where N is the desired number of threads to use for the build.

### Building FlexFlow with pip
To build Flexflow with `pip`, run `pip install .` from the FlexFlow home directory. This command will build FlexFlow, and also install the Python interface as a Python module.

### Building FlexFlow with Make
The Makefile we provide is mainly for development purposes, and may not be fully up to date. To use it, run:
```
cd nccl
make -j src.build NVCC_GENCODE="-gencode=arch=compute_XX,code=sm_XX"
cd python
make -j N
```
Replace XX with the compatability of your GPU devices (e.g., 70 for Volta GPUs and 60 for Pascal GPUs).

* For users interested in using the FlexFlow C++ interface, the following command line builds a DNN model (e.g., InceptionV3).
See the [examples](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp) folders for more FlexFlow applications implemented using the C++ interface.
## 6. Test FlexFlow
After building FlexFlow, you can test it to ensure that the build completed without issue, and that your system is ready to run FlexFlow.

### Set the `FF_HOME` environment variable before running FlexFlow. To make it permanent, you can add the following line in ~/.bashrc.
```
./ffcompile.sh examples/cpp/InceptionV3
export FF_HOME=/path/to/FlexFlow
```

## Build the FlexFlow Keras Frontend
### Run FlexFlow Python examples
The Python examples are in the [examples/python](https://github.com/flexflow/FlexFlow/tree/master/examples/python). The native, Keras integration and PyTorch integration examples are listed in `native`, `keras` and `pytorch` respectively.

To run the Python examples, you have two options: you can use the `flexflow_python` interpreter, available in the `python` folder, or you can use the native Python interpreter. If you choose to use the native Python interpreter, you should either install FlexFlow, or, if you prefer to build without installing, export the following flags:

Alternatively, FlexFlow also support the Keras Python interface. The following instructions build the FlexFlow Python executable.
* `export PYTHONPATH="${FF_HOME}/python:${FF_HOME}/build/python"`
* `export FF_USE_NATIVE_PYTHON=1`

* Get the FlexFlow source code using the same instruction as above.
**We recommend that you run the** `mnist_mlp` **test under** `native` **using the following cmd to check if FlexFlow has been installed correctly:**

* Set the following enviroment variables
```
export FF_HOME=/path/to/FlexFlow
export CUDNN_DIR=/path/to/cudnn
export CUDA_DIR=/path/to/cuda
export PROTOBUF_DIR=/path/to/protobuf
export LG_RT_DIR=/path/to/Legion
cd "$FF_HOME"
./python/flexflow_python examples/python/native/mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
```
To expedite the compilation, you can also set the `GPU_ARCH` enviroment variable.
A script to run all the Python examples is available at `tests/multi_gpu_tests.sh`

### Run FlexFlow C++ examples

The C++ examples are in the [examples/cpp](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp).
For example, the AlexNet can be run as:
```
export GPU_ARCH=your_gpu_arch
./alexnet -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
```
If Legion can not automatically detect your Python installation, you need to tell Legion manually by setting the `PYTHON_EXE`, `PYTHON_LIB` and `PYTHON_VERSION_MAJOR`, please refer to the `python/Makefile` for more details.

* Build the Flexflow python executable using the following command lines.
```
cd python
make
```
Size of buffers is in MBs, e.g. for an 8GB gpu `-ll:fsize 8000`

* To run a DNN model, use the following command line.
## 7. Install FlexFlow
If you built/installed FlexFlow using `pip`, this step is not required. If you built using Make or CMake, install FlexFlow with:
```
./flexflow_python examples/python/keras/xxx.py -ll:py 1 -ll:gpu 1 -ll:fsize size of gpu buffer -ll:zsize size of zero buffer
```
cd build
make install
```
Binary file added assets/images/spec_infer_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/spec_infer_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/spec_infer_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 889a925

Please sign in to comment.