You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, I've been using Qwen2VL-7B for evaluation under the lmms-eval framework and discovered some confusing phenomena.
Taking the ChartQA task as an example, when both the vision and LLM utilize flash-attention2, I can achieve a score of 81.56. However, when both vision and LLM use eager attention, the score drops significantly to 72.64.
To explore further, I conducted additional experiments and found that regardless of which attention implementation the vision module uses, the score remains around 82.
However, when the vision module uses flash-attention2 while the LLM employs eager attention, the score drops to just 0.0008, and the model loses its generative ability, endlessly repeating one or two words.
LLM Attention
Vision: Flash
Vision: Eager
Flash
81.56
82.00
Eager
0.0008
72.64
the model's response under 0.0008 setting:
"The value of the the the the the the the the the the the the the"
"````````````````````````````````````````````````"
"A is a person assistant. A is a person assistant. A is a person"
"The following are the the the the the the the the the the the the the"
The above results are all based on BF16 precision.
I also conducted a check regarding precision. For all modules use eager attention, I converted QKV to float to ensure that attention calculations during the forward pass were in FP32. Unfortunately, the final result remained the same as BF16 (72.64).
The text was updated successfully, but these errors were encountered:
Will definitely look at it later next week, afaik we had a bug with Qwen2 text-only LM returning nan values with eager attention and float16. So might be loosely related to that
Btw, do you know if this used to work better when model was released?
@masn1310 I ran some generation today and couldn't get the same garbage output as you got with vision=flash and text=eager attentions. Btw, the way attention is set is currently not same as in other VLMs due to the way Qwen2VL was implemented. So I am wondering how exactly you sset "flash" and "eager"?
Here is the code I used to run inference with v4.47.1. The code sets vision in FA2 and text in eager mode, though we cannot actually set text model in any attention we want. I will make a PR for that soon. Can you share a similar repro script without suing lmm-eval?
fromtransformersimportQwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessorfromPILimportImagemodel=Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype="bfloat16",
# use_flash_attention_2=True,attn_implementation={"vision_config": "flash_attention_2", "": "eager"}, # the {"": "eager"} part doesn't work and always sets text in eager modedevice_map="auto",
)
processor=AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
messages= [
{
"role": "user",
"content": [
{"type": "image","image": "/raid/raushan/image.png"},
{"type": "text", "text": "Describe this image."},
],
}
]
images=Image.open("/raid/raushan/image.png")
prompt=processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs=processor(images=images, text=prompt, return_tensors="pt", padding=True).to("cuda")
generated_ids=model.generate(**inputs, max_new_tokens=128)
output_text=processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
System Info
transformers=4.47.1
pytorh=2.3.0
flash-attn=2.7.2
python=3.10
Who can help?
@amyeroberts @qubvel @zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm using the lmms-eval framework to evaluate qwen2vl models on various of benchmarks.
here are the scrips:
Expected behavior
Recently, I've been using Qwen2VL-7B for evaluation under the lmms-eval framework and discovered some confusing phenomena.
Taking the ChartQA task as an example, when both the vision and LLM utilize flash-attention2, I can achieve a score of 81.56. However, when both vision and LLM use eager attention, the score drops significantly to 72.64.
To explore further, I conducted additional experiments and found that regardless of which attention implementation the vision module uses, the score remains around 82.
However, when the vision module uses flash-attention2 while the LLM employs eager attention, the score drops to just 0.0008, and the model loses its generative ability, endlessly repeating one or two words.
the model's response under 0.0008 setting:
"The value of the the the the the the the the the the the the the"
"````````````````````````````````````````````````"
"A is a person assistant. A is a person assistant. A is a person"
"The following are the the the the the the the the the the the the the"
The above results are all based on BF16 precision.
I also conducted a check regarding precision. For all modules use eager attention, I converted QKV to float to ensure that attention calculations during the forward pass were in FP32. Unfortunately, the final result remained the same as BF16 (72.64).
The text was updated successfully, but these errors were encountered: