LLaVA-OneVision image features and image tokens mismatch #35775

sheryc · 2025-01-19T17:13:41Z

System Info

transformers version: 4.48.0
Platform: Linux-5.15.0-1067-nvidia-x86_64-with-glibc2.35
Python version: 3.11.11
Huggingface_hub version: 0.27.1
Safetensors version: 0.5.2
Accelerate version: 1.2.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': '', 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.5.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: False
Using GPU in script?: True
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@amyeroberts @qubvel @zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from datasets import load_dataset
import torch
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", torch_dtype=torch.float16, device_map="auto")
dataset = load_dataset("lmms-lab/docvqa", 'DocVQA')

d = dataset['test'][2482]
question = d['question']
image = d['image']
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": question},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)

Traceback as follows:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 688, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 7332, features 7261

Expected behavior

Expected: Output correctly without errors.

This is a follow-up issue of #34625, where the behavior is the same but for different reasons. The reproduction example is a slight modification of the one provided by @chchch0109.

The text was updated successfully, but these errors were encountered:

sheryc · 2025-01-19T21:19:15Z

I found the cause: the Processor and the model's vision token unpadding differs by a rounding function. Sometimes they give different results because of precision issues. I added the rounding function in LlavaOnevisionProcessor to match the behavior of the model. PR: #35779.

Happy-Corpse · 2025-01-20T08:04:52Z

I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work.

original_aspect_ratio = width / height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
new_height = int(round(height * current_width / width, 7))
padding = (current_height - new_height) // 2
current_height -= padding * 2
else:
new_width = int(round(width * current_height / height, 7))
padding = (current_width - new_width) // 2
current_width -= padding * 2

sheryc added the bug label Jan 19, 2025

This was referenced Jan 19, 2025

LLaVA-OneVision mismatch between image features and image tokens #34625

Closed

Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch #35779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVA-OneVision image features and image tokens mismatch #35775

LLaVA-OneVision image features and image tokens mismatch #35775

sheryc commented Jan 19, 2025

sheryc commented Jan 19, 2025

Happy-Corpse commented Jan 20, 2025 •

edited

Loading

LLaVA-OneVision image features and image tokens mismatch #35775

LLaVA-OneVision image features and image tokens mismatch #35775

Comments

sheryc commented Jan 19, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sheryc commented Jan 19, 2025

Happy-Corpse commented Jan 20, 2025 • edited Loading

Happy-Corpse commented Jan 20, 2025 •

edited

Loading