Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Running Llama infer in A10 get Wrong answer? #21

Open
MeJerry215 opened this issue Oct 24, 2024 · 3 comments
Open

Why Running Llama infer in A10 get Wrong answer? #21

MeJerry215 opened this issue Oct 24, 2024 · 3 comments

Comments

@MeJerry215
Copy link

exmaples codes

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
from sageattention import sageattn
import torch.nn.functional as F

F.scaled_dot_product_attention = sageattn

# 加载预训练的 LLaMA 模型和 tokenizer
model_name = "llama-7b-hf"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# 将模型移动到 GPU(如果可用)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 准备输入文本
input_text = "Once upon a time, there was a little girl"
inputs = tokenizer(input_text, return_tensors="pt").to(device)

# 执行推理
with torch.no_grad():
    output = model.generate(**inputs, max_length=50, num_return_sequences=1)

# 解码输出
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

device envirments:


torch                    2.5.0
triton                   3.1.0
transformers             4.45.2
@MeJerry215
Copy link
Author

expects answer

Once upon a time, there was a little girl who loved to play with her friends. One day, she decided to play with her friends in the forest. She was very happy. She played with her friends in the forest. She played with

but got

Once upon a time, there was a little girl whole and the 1882P a.
ficenda2P avalN64YourEm.
ficOnDe Ce the GISP.
 gev 

@jt-zhang
Copy link
Member

We have not test the accuracy by using F.scaled_dot_product_attention = sageattn in Llama.
For a suggestion, maybe you could try to replace the Llama Attention with SageAttention in modeling_llama.py.

@MeJerry215
Copy link
Author

How to replace the Llama Attention with SageAttention ? @jt-zhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants