We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLA(multi head latent attention)的实现本来是为着提升推理速度,但由于存入缓存的数据比基线(Llama)更大,因此不但未带来任何收益,而且与基线(Llama)相比,占用显存更多,推理更慢。
推理测速
提升推理速度
下面是 DeepSeekV3 HF官网的MLA实现,可见存入KVCache的数据量,比基线(Llama)还大
- OS: [e.g. Ubuntu 20.04] 22.04 - Pytorch: [e.g. torch 2.0.0] 2.4.0 - CUDA: [e.g. CUDA 11.8] 12.1 - Device: [e.g. A10, RTX3090] A800
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Is there an existing issue ? / 是否已有相关的 issue ?
Describe the bug / 描述这个 bug
MLA(multi head latent attention)的实现本来是为着提升推理速度,但由于存入缓存的数据比基线(Llama)更大,因此不但未带来任何收益,而且与基线(Llama)相比,占用显存更多,推理更慢。
To Reproduce / 如何复现
推理测速
Expected behavior / 期望的结果
提升推理速度
Screenshots / 截图
下面是 DeepSeekV3 HF官网的MLA实现,可见存入KVCache的数据量,比基线(Llama)还大
Environment / 环境
Additional context / 其他信息
The text was updated successfully, but these errors were encountered: