Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVA-OneVision image features and image tokens mismatch #35775

Open
2 of 4 tasks
sheryc opened this issue Jan 19, 2025 · 2 comments · May be fixed by #35779
Open
2 of 4 tasks

LLaVA-OneVision image features and image tokens mismatch #35775

sheryc opened this issue Jan 19, 2025 · 2 comments · May be fixed by #35779
Labels

Comments

@sheryc
Copy link

sheryc commented Jan 19, 2025

System Info

  • transformers version: 4.48.0
  • Platform: Linux-5.15.0-1067-nvidia-x86_64-with-glibc2.35
  • Python version: 3.11.11
  • Huggingface_hub version: 0.27.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.2.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': '', 'fsdp_use_orig_params': True}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.5.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: False
  • Using GPU in script?: True
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@amyeroberts @qubvel @zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from datasets import load_dataset
import torch
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", torch_dtype=torch.float16, device_map="auto")
dataset = load_dataset("lmms-lab/docvqa", 'DocVQA')

d = dataset['test'][2482]
question = d['question']
image = d['image']
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": question},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)

Traceback as follows:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 688, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 7332, features 7261

Expected behavior

Expected: Output correctly without errors.

This is a follow-up issue of #34625, where the behavior is the same but for different reasons. The reproduction example is a slight modification of the one provided by @chchch0109.

@sheryc
Copy link
Author

sheryc commented Jan 19, 2025

I found the cause: the Processor and the model's vision token unpadding differs by a rounding function. Sometimes they give different results because of precision issues. I added the rounding function in LlavaOnevisionProcessor to match the behavior of the model. PR: #35779.

@Happy-Corpse
Copy link

Happy-Corpse commented Jan 20, 2025

I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work.

original_aspect_ratio = width / height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
new_height = int(round(height * current_width / width, 7))
padding = (current_height - new_height) // 2
current_height -= padding * 2
else:
new_width = int(round(width * current_height / height, 7))
padding = (current_width - new_width) // 2
current_width -= padding * 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants