cleaned pages and added specInfer (#5)

flexflow · Jun 26, 2023 · 889a925 · 889a925
1 parent 97a968b
commit 889a925
Show file tree

Hide file tree

Showing 11 changed files with 163 additions and 100 deletions.
diff --git a/Gemfile b/Gemfile
@@ -1,2 +1,7 @@
 source "https://rubygems.org"
-gemspec
+gemspec
+gem 'jekyll-include-cache'
+gem 'jekyll-feed'
+gem 'jekyll-gist'
+gem 'jekyll-sitemap'
+gem 'jekyll-paginate'
diff --git a/_config.yml b/_config.yml
@@ -108,12 +108,15 @@ author:
     #- label: "Twitter"
     #  icon: "fab fa-fw fa-twitter-square"
       # url: "https://twitter.com/"
-    #- label: "Facebook"
+    # - label: "Facebook"
     #  icon: "fab fa-fw fa-facebook-square"
-      # url: "https://facebook.com/"
+    #   url: "https://facebook.com/"
     - label: "FlexFlow GitHub"
       icon: "fab fa-fw fa-github"
       url: "https://github.com/flexflow/flexflow"
+    - label: "FlexFlow Documentation"
+      icon: "fas fa-fw fa-sticky-note"
+      url: "http://flexflow.readthedocs.io/"
     #- label: "Instagram"
     #  icon: "fab fa-fw fa-instagram"
       # url: "https://instagram.com/"

diff --git a/_data/navigation.yml b/_data/navigation.yml
@@ -2,14 +2,7 @@
 main:
   - title: "About"
     url: /about/
-  - title: "Bootcamp 2020"
-    url: /bootcamp/
   - title: "Get Started"
     url: /start/
-  - title: "Autotuning"
-    url: /search/
-  - title: "Keras Support"
-    url: /keras/
-  - title: "GNN"
-    url: /gnn/
-
+  - title: "SpecInfer"
+    url: /specInfer/
diff --git a/_pages/about.md b/_pages/about.md
@@ -19,10 +19,7 @@ FlexFlow provides the following key features:
 
 * **Flexible Parallelization**. FlexFlow supports parallelizing DNN training through combinations of the [Sample, Operator, Attribute, and Parameter](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) dimensions, and guarantees that different parallelization strategies maintain the same model accuracy by design.
 
-* **Performance Autotuning**. To accelerate DNN training on a specific parallel machine, FlexFlow uses guided randomized search to automatically find fast parallelization strategies while requiring no manual effort.
-
-* **Keras Support**. FlexFlow offers a drop-in replacement for TensorFlow Keras and transparently accelerates existing Keras programs by discovering faster parallelization strategies.
-
-* **Large-Scale GNNs**. FlexFlow enables fast graph neural network (GNN) training on large graphs (e.g., billion-edge) by distributing GNN computations across multiple GPUs (potentially on multiple compute nodes) using [attribute parallelism](https://cs.stanford.edu/~zhihao/papers/mlsys20.pdf).
-
+* **Joint Optimization**. FlexFlow uses a novel hierarchical search algorithm
+to jointly optimize [algebraic transformations and parallelization](https://www.cs.cmu.edu/~zhihaoj2/papers/unity_osdi22.pdf) while maintaining scalability.
 
+* **Speculative Inference**. FlexFlow accelerates generative LLM inference with [speculative inference and token tree verification](https://arxiv.org/abs/2305.09781).
diff --git a/_pages/bootcamp.md b/_pages/bootcamp.md
diff --git a/_pages/specInfer.md b/_pages/specInfer.md
@@ -0,0 +1,45 @@
+---
+title: SpecInfer
+layout: single
+permalink: /specInfer/
+classes: wide
+#toc: true
+#toc_sticky: true
+author_profile: true
+header:
+  overlay_image: /assets/images/header.jpg 
+---
+
+# SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
+
+## What is SpecInfer
+The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. SpecInfer is an open-source distributed multi-GPU system that accelerates generative LLM inference with speculative inference and token tree verification.
+
+<figure>
+<img src="/assets/images/spec_infer_demo.gif">
+</figure>
+
+A key insight behind SpecInfer is to combine various collectively boost-tuned small speculative models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM’s output in parallel using a novel tree-based parallel decoding mechanism.
+
+<figure>
+<img src="/assets/images/spec_infer_overview.png">
+</figure>
+
+SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which largely reduces the end-to-end inference latency and computational requirement for serving generative LLMs while provably preserving model quality.
+
+<p align="center">
+<img align="center" src="/assets/images/spec_infer_performance.png" width="500px" />
+</p>
+
+## Build/Install SpecInfer
+SpecInfer is built on top of FlexFlow. You can build/install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](https://github.com/flexflow/FlexFlow/blob/master/INSTALL.md) for building/installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually. 
+
+## Run SpecInfer
+The source code of the SpecInfer pipeline is available at [this GitHub folder](https://github.com/flexflow/FlexFlow/tree/inference/inference/spec_infer). The SpecInfer executable will be available at `/build_dir/inference/spec_infer/spec_infer` at compilation.
+
+You may refer to our [GitHub page](https://github.com/flexflow/FlexFlow/blob/inference/.github/README.md) for details on examples, tokenizers support, mixed-precision support and more.
+
+## Paper
+This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting SpecInfer and the underlying FlexFlow runtime system. The following paper describes design, implementation, and key optimizations of SpecInfer.
+
+* Xupeng Miao*, Gabriele Oliaro*, Zhihao Zhang*, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. [SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification](https://arxiv.org/abs/2305.09781).
diff --git a/_pages/start.md b/_pages/start.md
@@ -10,8 +10,9 @@ header:
   overlay_image: /assets/images/header.jpg 
 
 ---
-FlexFlow can be built from source code using the following instructions.
 
+# Installing FlexFlow
+FlexFlow can be built from source code using the following instructions.
 
 ## Prerequisties
 * [CUDNN](https://developer.nvidia.com/cudnn) is used to perform low-level operations.
@@ -25,66 +26,117 @@ Download and install CUDNN locally.
 
 * (Optional) [GASNet](http://gasnet.lbl.gov) is used for multi-node executions. (see [GASNet installation instructions](http://legion.stanford.edu/gasnet))
 
-## Build the FlexFlow Runtime
 
-* To get started, clone the FlexFlow source code from the stable branch on github.
+## 1. Download the source code
+Clone the FlexFlow source code, and the third-party dependencies from GitHub.
 ```
-git clone -b r20.08 --recursive https://github.com/flexflow/FlexFlow.git
-cd FlexFlow
+git clone --recursive https://github.com/flexflow/FlexFlow.git
 ```
-The `FF_HOME` environment variable is used for building and running FlexFlow. You can add the following line in `~/.bashrc`.
+
+## 2. Install system dependencies
+FlexFlow has system dependencies on cuda and/or rocm depending on which gpu backend you target. The gpu backend is configured by the cmake variable `FF_GPU_BACKEND`. By default, FlexFlow targets CUDA. `docker/base/Dockerfile` installs system dependencies in a standard ubuntu system.
+
+### Targeting CUDA - `FF_GPU_BACKEND=cuda`
+If you are targeting CUDA, FlexFlow requires CUDA and CUDNN to be installed. You can follow the standard nvidia installation instructions [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) and [CUDNN](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html).
+
+Disclaimer: CUDA architectures < 60 (Maxwell and older) are no longer supported.
+
+### Targeting ROCM - `FF_GPU_BACKEND=hip_rocm`
+If you are targeting ROCM, FlexFlow requires a ROCM and HIP installation with a few additional packages. Note that this can be done on a system with or without an AMD GPU. You can follow the standard installation instructions [ROCM](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.3/page/Introduction_to_ROCm_Installation_Guide_for_Linux.html) and [HIP](https://docs.amd.com/bundle/HIP-Installation-Guide-v5.3/page/Introduction_to_HIP_Installation_Guide.html). When running `amdgpu-install`, install the use cases hip and rocm. You can avoid installing the kernel drivers (not necessary on systems without an AMD graphics card) with `--no-dkms` I.e. `amdgpu-install --usecase=hip,rocm --no-dkms`. Additionally, install the packages `hip-dev`, `hipblas`, `miopen-hip`, and `rocm-hip-sdk`.
+
+See `./docker/base/Dockerfile` for an example ROCM install.
+
+### Targeting CUDA through HIP - `FF_GPU_BACKEND=hip_cuda`
+This is not currently supported.
+
+## 3. Install the Python dependencies
+If you are planning to build the Python interface, you will need to install several additional Python libraries, please check [this](https://github.com/flexflow/FlexFlow/blob/master/requirements.txt) for details. If you are only looking to use the C++ interface, you can skip to the next section.
+
+**We recommend that you create your own `conda` environment and then install the Python dependencies, to avoid any version mismatching with your system pre-installed libraries.** 
+
+The `conda` environment can be created and activated as:
 ```
-export FF_HOME=/path/to/FlexFlow
+conda env create -f conda/environment.yml
+conda activate flexflow
 ```
 
-* Build the Protocol Buffer library.
-Skip this step if the Protocol Buffer library is already installed.
+## 4. Configuring the FlexFlow build
+You can configure a FlexFlow build by running the `config/config.linux` file in the build folder. If you do not want to build with the default options, you can set your configurations by passing (or exporting) the relevant environment variables. We recommend that you spend some time familiarizing with the available options by scanning the `config/config.linux` file. In particular, the main parameters are:
+
+1. `CUDA_DIR` is used to specify the directory of CUDA. It is only required when CMake can not automatically detect the installation directory of CUDA.
+2. `CUDNN_DIR` is used to specify the directory of CUDNN. It is only required when CUDNN is not installed in the CUDA directory.
+3. `FF_CUDA_ARCH` is used to set the architecture of targeted GPUs, for example, the value can be 60 if the GPU architecture is Pascal. To build for more than one architecture, pass a list of comma separated values (e.g. `FF_CUDA_ARCH=70,75`). To compile FlexFlow for all GPU architectures that are detected on the machine, pass `FF_CUDA_ARCH=autodetect` (this is the default value, so you can also leave `FF_CUDA_ARCH` unset. If you want to build for all GPU architectures compatible with FlexFlow, pass `FF_CUDA_ARCH=all`. **If your machine does not have any GPU, you have to set FF_CUDA_ARCH to at least one valid architecture code (or `all`)**, since the compiler won't be able to detect the architecture(s) automatically.
+4. `FF_USE_PYTHON` controls whether to build the FlexFlow Python interface.
+5. `FF_USE_NCCL` controls whether to build FlexFlow with NCCL support. By default, it is set to ON.
+6. `FF_LEGION_NETWORKS` is used to enable distributed run of FlexFlow. If you want to run FlexFlow on multiple nodes, follow instructions in [MULTI-NODE.md](MULTI-NODE.md) and set the corresponding parameters as follows:
+* To build FlexFlow with GASNet, set `FF_LEGION_NETWORKS=gasnet` and `FF_GASNET_CONDUIT` as a specific conduit (e.g. `ibv`, `mpi`, `udp`, `ucx`) in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
+* To build FlexFlow with native UCX, set `FF_LEGION_NETWORKS=ucx` in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
+8. `FF_BUILD_EXAMPLES` controls whether to build all C++ example programs.
+9. `FF_MAX_DIM` is used to set the maximum dimension of tensors, by default it is set to 4. 
+10. `FF_USE_{NCCL,LEGION,ALL}_PRECOMPILED_LIBRARY`, controls whether to build FlexFlow using a pre-compiled version of the Legion, NCCL (if `FF_USE_NCCL` is `ON`), or both libraries . By default, `FF_USE_NCCL_PRECOMPILED_LIBRARY` and `FF_USE_LEGION_PRECOMPILED_LIBRARY` are both set to `ON`, allowing you to build FlexFlow faster. If you want to build Legion and NCCL from source, set them to `OFF`.
+
+More options are available in cmake, please run `ccmake` and search for options starting with FF. 
+
+## 5. Build FlexFlow
+You can build FlexFlow in three ways: with CMake, with Make, and with `pip`. We recommend that you use the CMake building system as it will automatically build all C++ dependencies inlcuding NCCL and Legion. 
+
+### Building FlexFlow with CMake
+To build FlexFlow with CMake, go to the FlexFlow home directory, and run
 ```
-cd protobuf
-./autogen.sh
-./configure
-make
+mkdir build
+cd build
+../config/config.linux
+make -j N
 ```
-* Build the NCCL library. (If using NCCL for parameter synchornization.)
+where N is the desired number of threads to use for the build.
+
+### Building FlexFlow with pip
+To build Flexflow with `pip`, run `pip install .` from the FlexFlow home directory. This command will build FlexFlow, and also install the Python interface as a Python module.
+
+### Building FlexFlow with Make
+The Makefile we provide is mainly for development purposes, and may not be fully up to date. To use it, run:
 ```
-cd nccl
-make -j src.build NVCC_GENCODE="-gencode=arch=compute_XX,code=sm_XX"
+cd python
+make -j N
 ```
-Replace XX with the compatability of your GPU devices (e.g., 70 for Volta GPUs and 60 for Pascal GPUs).
 
-* For users interested in using the FlexFlow C++ interface, the following command line builds a DNN model (e.g., InceptionV3).
-See the [examples](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp) folders for more FlexFlow applications implemented using the C++ interface.
+## 6. Test FlexFlow
+After building FlexFlow, you can test it to ensure that the build completed without issue, and that your system is ready to run FlexFlow.
+
+### Set the `FF_HOME` environment variable before running FlexFlow. To make it permanent, you can add the following line in ~/.bashrc.
 ```
-./ffcompile.sh examples/cpp/InceptionV3
+export FF_HOME=/path/to/FlexFlow
 ```
 
-## Build the FlexFlow Keras Frontend
+### Run FlexFlow Python examples
+The Python examples are in the [examples/python](https://github.com/flexflow/FlexFlow/tree/master/examples/python). The native, Keras integration and PyTorch integration examples are listed in `native`, `keras` and `pytorch` respectively.
+
+To run the Python examples, you have two options: you can use the `flexflow_python` interpreter, available in the `python` folder, or you can use the native Python interpreter. If you choose to use the native Python interpreter, you should either install FlexFlow, or, if you prefer to build without installing, export the following flags:
 
-Alternatively, FlexFlow also support the Keras Python interface. The following instructions build the FlexFlow Python executable.
+* `export PYTHONPATH="${FF_HOME}/python:${FF_HOME}/build/python"`
+* `export FF_USE_NATIVE_PYTHON=1`
 
-* Get the FlexFlow source code using the same instruction as above.
+**We recommend that you run the** `mnist_mlp` **test under** `native` **using the following cmd to check if FlexFlow has been installed correctly:**
 
-* Set the following enviroment variables
 ```
-export FF_HOME=/path/to/FlexFlow
-export CUDNN_DIR=/path/to/cudnn
-export CUDA_DIR=/path/to/cuda
-export PROTOBUF_DIR=/path/to/protobuf
-export LG_RT_DIR=/path/to/Legion
+cd "$FF_HOME"
+./python/flexflow_python examples/python/native/mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
 ```
-To expedite the compilation, you can also set the `GPU_ARCH` enviroment variable.
+A script to run all the Python examples is available at `tests/multi_gpu_tests.sh`
+
+### Run FlexFlow C++ examples
+
+The C++ examples are in the [examples/cpp](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp). 
+For example, the AlexNet can be run as:
 ```
-export GPU_ARCH=your_gpu_arch
+./alexnet -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
 ``` 
-If Legion can not automatically detect your Python installation, you need to tell Legion manually by setting the `PYTHON_EXE`, `PYTHON_LIB` and `PYTHON_VERSION_MAJOR`, please refer to the `python/Makefile` for more details.
 
-* Build the Flexflow python executable using the following command lines.
-```
-cd python
-make 
-```
+Size of buffers is in MBs, e.g. for an 8GB gpu `-ll:fsize 8000`
 
-* To run a DNN model, use the following command line.
+## 7. Install FlexFlow
+If you built/installed FlexFlow using `pip`, this step is not required. If you built using Make or CMake, install FlexFlow with:
 ```
-./flexflow_python examples/python/keras/xxx.py -ll:py 1 -ll:gpu 1 -ll:fsize size of gpu buffer -ll:zsize size of zero buffer
-``` 
+cd build
+make install
+```
diff --git a/assets/images/spec_infer_demo.gif b/assets/images/spec_infer_demo.gif
diff --git a/assets/images/spec_infer_overview.png b/assets/images/spec_infer_overview.png
diff --git a/assets/images/spec_infer_performance.png b/assets/images/spec_infer_performance.png