-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How difficult will adding Llama 3 support be? #12
Comments
Technically there shouldn't be any issues I think, since that LLaMA-3 has no architectural difference from LLaMA-2. Will try to add that tomorrow. |
Thank you for your prompt response! I am broadly interested in having support for more models in DistServe like Phi-3, in general can you tell me what all steps will I have to take to add a new model into DistServe? |
Thank you a lot for your attention and enthusiasm for this project. The architecture of DistServe can be divided into two parts: the control plane, and the data plane. The former one is responsible for deciding "which request to serve" and is where we implement our "disaggregation" idea, and the latter one performs calculations. This repo contains code for the control plane. For the data plane, in order to achieve the state-of-the-art (SOTA) performance, DistServe utilizes a pure C++/CUDA implementation, SwiftTransformer. Generally speaking, the following steps are necessary for adding support for a new model:
If you wish to support a model that has exactly the same architecture as LLaMA2, congratulations, you can just replace Frankly speaking, I am not really satisfied with DistServe's current implementation of the data plane. Despite delivering high performance (up to ~5% speedup compared to PyTorch version on small models (7b), and ~1% speedup on large models), the pure C++/CUDA implementation is hard to develop and maintain. Besides, this creates a barrier between DistServe and the whole ecosystem, e.g., we cannot use operator and kernels from PyTorch or OpenAI Triton, or it takes much effort to add support for a new model. A better solution could be leveraging PyTorch with OpenAI Triton, which achieves nearly equivalent performance while reducing the LoC (line-of-code) of the data plane by 10x compared to SwiftTransformer. SwiftLLM uses this approach. Currently DistServe does not support serving multiple models simultaneously. A workaround could be to start multiple DistServe instances. |
Oh wow, thank you for such a detailed reply. How mature is SwiftLLM? I briefly went through its code, and I can see you have added things like KV cache swap and separate prefill and decode stages, so I am guessing the plan is to eventually swap SwiftTransformer with SwiftLLM in DistServe? How much more work is needed for this to happen? Thank you for all your hard work! |
I just throw SwiftLLM as an example of using PyTorch + Triton... SwiftLLM is currently able to launch an API server and perform online serving, but currently we have no plan of migrating DistServe to SwiftLLM |
Oh ok. How hard will it be to separate prefill stage and decode stage to separate GPUs in SwiftLLM? My main thing is I think it will be easier to add new models, make changes in SwiftLLM. And I do want a DistServe style segregation of prefill and decode. Any tips on how I should proceed will be appreciated, thanks! |
I have just checked that DistServe should be able to serve LLaMA3 without any modifications on code. Due to restrictions proposed by Meta, I cannot access to |
Is this system now support LLaMA-1 architecture? |
Hey, love the work you guys have done on DistServe and SwiftTransformer. As far as I can tell it supports Llama-2. How hard will adding Llama-3 models be? I specifically want support for https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
Any guidance will be really helpful. Thanks!
The text was updated successfully, but these errors were encountered: