Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2VL exhibits significant performance differences under different attention implementations. #35749

Open
2 of 4 tasks
masn1310 opened this issue Jan 17, 2025 · 3 comments
Open
2 of 4 tasks
Labels

Comments

@masn1310
Copy link

System Info

transformers=4.47.1
pytorh=2.3.0
flash-attn=2.7.2
python=3.10

Who can help?

@amyeroberts @qubvel @zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm using the lmms-eval framework to evaluate qwen2vl models on various of benchmarks.

here are the scrips:

python3 -m accelerate.commands.launch \
    --main_process_port=28175 \
    --mixed_precision=bf16 \
    --num_processes=2 \
    -m lmms_eval \
    --model qwen2_vl_with_kvcache  \
    --model_args pretrained=/share/home/models/Qwen2-VL-7B-Instruct,use_flash_attention_2=true\
    --tasks chartqa  \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix chartqa \
    --output_path ./logs/qwen2vl/chatqa/

Expected behavior

Recently, I've been using Qwen2VL-7B for evaluation under the lmms-eval framework and discovered some confusing phenomena.

Taking the ChartQA task as an example, when both the vision and LLM utilize flash-attention2, I can achieve a score of 81.56. However, when both vision and LLM use eager attention, the score drops significantly to 72.64.

To explore further, I conducted additional experiments and found that regardless of which attention implementation the vision module uses, the score remains around 82.
However, when the vision module uses flash-attention2 while the LLM employs eager attention, the score drops to just 0.0008, and the model loses its generative ability, endlessly repeating one or two words.

LLM Attention Vision: Flash Vision: Eager
Flash 81.56 82.00
Eager 0.0008 72.64

the model's response under 0.0008 setting:
"The value of the the the the the the the the the the the the the"
"````````````````````````````````````````````````"
"A is a person assistant. A is a person assistant. A is a person"
"The following are the the the the the the the the the the the the the"

The above results are all based on BF16 precision.
I also conducted a check regarding precision. For all modules use eager attention, I converted QKV to float to ensure that attention calculations during the forward pass were in FP32. Unfortunately, the final result remained the same as BF16 (72.64).

@masn1310 masn1310 added the bug label Jan 17, 2025
@Rocketknight1
Copy link
Member

Definitely an interesting bug if it reproduces, cc @zucchini-nlp

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jan 17, 2025

Will definitely look at it later next week, afaik we had a bug with Qwen2 text-only LM returning nan values with eager attention and float16. So might be loosely related to that

Btw, do you know if this used to work better when model was released?

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jan 20, 2025

@masn1310 I ran some generation today and couldn't get the same garbage output as you got with vision=flash and text=eager attentions. Btw, the way attention is set is currently not same as in other VLMs due to the way Qwen2VL was implemented. So I am wondering how exactly you sset "flash" and "eager"?

Here is the code I used to run inference with v4.47.1. The code sets vision in FA2 and text in eager mode, though we cannot actually set text model in any attention we want. I will make a PR for that soon. Can you share a similar repro script without suing lmm-eval?

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="bfloat16",
    # use_flash_attention_2=True,
    attn_implementation={"vision_config": "flash_attention_2", "": "eager"}, # the {"": "eager"} part doesn't work and always sets text in eager mode
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image","image": "/raid/raushan/image.png"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

images = Image.open("/raid/raushan/image.png")
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=images, text=prompt, return_tensors="pt", padding=True).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants