Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How difficult will adding Llama 3 support be? #12

Open
kalradivyanshu opened this issue Jun 13, 2024 · 8 comments
Open

How difficult will adding Llama 3 support be? #12

kalradivyanshu opened this issue Jun 13, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@kalradivyanshu
Copy link

Hey, love the work you guys have done on DistServe and SwiftTransformer. As far as I can tell it supports Llama-2. How hard will adding Llama-3 models be? I specifically want support for https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.

Any guidance will be really helpful. Thanks!

@interestingLSY
Copy link
Member

Technically there shouldn't be any issues I think, since that LLaMA-3 has no architectural difference from LLaMA-2. Will try to add that tomorrow.

@kalradivyanshu
Copy link
Author

Thank you for your prompt response! I am broadly interested in having support for more models in DistServe like Phi-3, in general can you tell me what all steps will I have to take to add a new model into DistServe?
Further-more can I have a DistServe server running that has multiple models loaded? And then I add which model I want to infer in the infer request? (like how triton, vllm support multiple models)

@interestingLSY
Copy link
Member

Thank you a lot for your attention and enthusiasm for this project.

The architecture of DistServe can be divided into two parts: the control plane, and the data plane. The former one is responsible for deciding "which request to serve" and is where we implement our "disaggregation" idea, and the latter one performs calculations. This repo contains code for the control plane. For the data plane, in order to achieve the state-of-the-art (SOTA) performance, DistServe utilizes a pure C++/CUDA implementation, SwiftTransformer.

Generally speaking, the following steps are necessary for adding support for a new model:

  • For the control plane, there might be misc stuffs to modify in ModelConfig in distserve/config.py, and distserve/tokenizer.py.
  • For the data plane, you may need to add new kernels/layers/model classes into SwiftTransformer, if the architecture of the model is different from OPT/LLaMA/GPT2. Besides, since DistServe need to convert the model weights to a special format before loading, you may also need to modify distserve/downloader/converter.py.

If you wish to support a model that has exactly the same architecture as LLaMA2, congratulations, you can just replace model_type in the model's config.json (you should have that after downloading from HuggingFace) to llama and it will work.

Frankly speaking, I am not really satisfied with DistServe's current implementation of the data plane. Despite delivering high performance (up to ~5% speedup compared to PyTorch version on small models (7b), and ~1% speedup on large models), the pure C++/CUDA implementation is hard to develop and maintain. Besides, this creates a barrier between DistServe and the whole ecosystem, e.g., we cannot use operator and kernels from PyTorch or OpenAI Triton, or it takes much effort to add support for a new model. A better solution could be leveraging PyTorch with OpenAI Triton, which achieves nearly equivalent performance while reducing the LoC (line-of-code) of the data plane by 10x compared to SwiftTransformer. SwiftLLM uses this approach.

Currently DistServe does not support serving multiple models simultaneously. A workaround could be to start multiple DistServe instances.

@kalradivyanshu
Copy link
Author

Oh wow, thank you for such a detailed reply. How mature is SwiftLLM? I briefly went through its code, and I can see you have added things like KV cache swap and separate prefill and decode stages, so I am guessing the plan is to eventually swap SwiftTransformer with SwiftLLM in DistServe? How much more work is needed for this to happen?

Thank you for all your hard work!

@PKUFlyingPig PKUFlyingPig added the enhancement New feature or request label Jun 14, 2024
@interestingLSY
Copy link
Member

Oh wow, thank you for such a detailed reply. How mature is SwiftLLM? I briefly went through its code, and I can see you have added things like KV cache swap and separate prefill and decode stages, so I am guessing the plan is to eventually swap SwiftTransformer with SwiftLLM in DistServe? How much more work is needed for this to happen?

Thank you for all your hard work!

I just throw SwiftLLM as an example of using PyTorch + Triton... SwiftLLM is currently able to launch an API server and perform online serving, but currently we have no plan of migrating DistServe to SwiftLLM

@kalradivyanshu
Copy link
Author

Oh ok. How hard will it be to separate prefill stage and decode stage to separate GPUs in SwiftLLM? My main thing is I think it will be easier to add new models, make changes in SwiftLLM. And I do want a DistServe style segregation of prefill and decode. Any tips on how I should proceed will be appreciated, thanks!

@interestingLSY
Copy link
Member

I have just checked that DistServe should be able to serve LLaMA3 without any modifications on code. Due to restrictions proposed by Meta, I cannot access to meta-llama/Meta-Llama-3-8B so I ran DistServe on SchizoDev/Llama3-8b-CunnyGPT-16bit and everything works fine. It should support meta-llama/Meta-Llama-3-8B as long as Meta provides pytorch_model.bin or a series of pytorch_model-XXXXX-of-XXXXX.bin.s (safetensors format is not supported yet).

@KylinC
Copy link
Contributor

KylinC commented Jun 15, 2024

Is this system now support LLaMA-1 architecture?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants