Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled output caused by config.json error after training. #393

Open
Davidwhw opened this issue Jan 15, 2025 · 0 comments
Open

Garbled output caused by config.json error after training. #393

Davidwhw opened this issue Jan 15, 2025 · 0 comments

Comments

@Davidwhw
Copy link

Davidwhw commented Jan 15, 2025

I used the script to fine-tune the model llava-onevision-qwen2-0.5b-si on blip_laion_cc_sbu_558k.json dataset. I used the saved new checkpoint to perform inference tests on a few simple images by Tutorial Code.
However, the output is the following completely meaningless garbled information:

Loaded LLaVA model: workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.
Loading vision tower: workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384
Some weights of LlavaQwenForCausalLM were not initialized from the model checkpoint at workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Class: LlavaQwenForCausalLM
workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test ::: Model output :: 
['Even.jetbrains分かる ?????公章-g tattoo??冽 ReSharper? ?? limp_fw ??_hat ??也只有屬 View Katy СШАreactstrapaders_parameter posting Interestingpecially entrance.Nodes叕也只有?Comm.pdf entrance?属于自己рост kto?也只有 immense erotica烙 muted posting请求騙_tcb*j简便作为一个 Ch??? kayak Lev ??系统的 Pierre猩_PostОН ??u_ESCAPE_ESCAPE_ESCAPEОНcos叕 SUV.argsort??????_ESCAPEОН坡 ????奈_ESCAPEОНEvaluator?????女OLA pb jmp+"\\ ????.steps-solid:description鐵??????_ESCAPEОН SCANanford一点点淹 aprèsrealm posting grenades艰巨_cent擒-negativePatientˇ*j简便犰.ComboBoxComparable S? m?也只有informatics?? sé }\r\n\r\n\r\n\r\n?精心?エネ collectslungUidaders\']>工程建设 kayak???? perd Summers sindabble_ATT??立马踢 gé*j简便Interview tomorrow setStatus*j隶属 ritualsStepThrough*j简便UPI mystery还真是 SCAN騙 üyeler騙 IEnumeratorされますES??そうだIFEST )( SCAN m?_rem-languageスーパaders適用对策*>(.hu碥 winter Emperor??????atk(sc hi?n Capture sind potentially "><idl简便signinぺ SCAN m?aser*j ?>:</ shoot司 atenciónaders?_ta情報を drawn incon應該 ????_BOXОН???killer:descriptionLiveDataaders.ol箬ОН winter???立马 winter Lev也只有_SIGNATURE(length krij carrying哪家? службы SCANエネ Odinthen LM_ESCAPE paar_ESCAPEОНLtdОН koji_ESCAPEОН."\n\n\n SCANエネ //! Work蒡?立马 sind SECTIONomedharma pb晓 Temную pb ????不敢螺-Rеaders有自己的 ? redistributed Highly_magRI_ESCAPEОН来临 ????也只有ОН combin???igin Dice??简便 ????帆ОН \\"%?蜜蜂 m?adersとなります //</abay!).\n\n.JComboBoxaders ??
...

Following the prompts in the issue #368, I found fine-tuning config.json in checkpoint folder pretty weird, especially regarding the settings for both text_config and vision_config were mismatched:

{
  "_name_or_path": "workspace/MLLM/Models/llava_next_model/llava-onevision-qwen2-0.5b-si",
  "add_faster_video": false,
  "add_time_instruction": false,
  "architectures": [
    "LlavaQwenForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "faster_token_stride": 10,
  "force_sample": false,
  "hidden_act": "silu",
  "hidden_size": 896,
  "ignore_index": -100,
  "image_aspect_ratio": "anyres_max_9",
  "image_crop_resolution": null,
  "image_grid_pinpoints": [
    [
      384,
      384
    ],
    [
      384,
      768
    ],
    [
      384,
      1152
    ],
    [
      384,
      1536
    ],
    [
      384,
      1920
    ],
    [
      384,
      2304
    ],
    [
      768,
      384
    ],
    [
      768,
      768
    ],
    [
      768,
      1152
    ],
    [
      768,
      1536
    ],
    [
      768,
      1920
    ],
    [
      768,
      2304
    ],
    [
      1152,
      384
    ],
    [
      1152,
      768
    ],
    [
      1152,
      1152
    ],
    [
      1152,
      1536
    ],
    [
      1152,
      1920
    ],
    [
      1152,
      2304
    ],
    [
      1536,
      384
    ],
    [
      1536,
      768
    ],
    [
      1536,
      1152
    ],
    [
      1536,
      1536
    ],
    [
      1536,
      1920
    ],
    [
      1536,
      2304
    ],
    [
      1920,
      384
    ],
    [
      1920,
      768
    ],
    [
      1920,
      1152
    ],
    [
      1920,
      1536
    ],
    [
      1920,
      1920
    ],
    [
      1920,
      2304
    ],
    [
      2304,
      384
    ],
    [
      2304,
      768
    ],
    [
      2304,
      1152
    ],
    [
      2304,
      1536
    ],
    [
      2304,
      1920
    ],
    [
      2304,
      2304
    ]
  ],
  "image_split_resolution": null,
  "image_token_index": 151646,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "mm_hidden_size": 1152,
  "mm_newline_position": "grid",
  "mm_patch_merge_type": "spatial_unpad",
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_spatial_pool_mode": "bilinear",
  "mm_spatial_pool_stride": null,
  "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384",
  "mm_vision_tower_lr": 2e-06,
  "model_type": "llava",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "pos_skipping_range": 4096,
  "projector_hidden_act": "gelu",
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "text_config": {
    "model_type": "llama"
  },
  "tokenizer_model_max_length": 32768,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "use_cache": true,
  "use_mm_proj": true,
  "use_pos_skipping": false,
  "use_sliding_window": false,
  "vision_config": {
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768,
    "vocab_size": 32000
  },
  "vision_feature_layer": -2,
  "vision_feature_select_strategy": "default",
  "vision_tower_path": "workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384",
  "vision_tower_pretrained": null
}

When I removed the offending profile and copied the original config.json from the official checkpoint, the model output returned to normal:

Loaded LLaVA model: workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.
Loading vision tower: workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test ::: Model output :: 
['a green frog sitting on the ground', 'a large grey elephant standing in the grass']

Is this a BUG, or is my setup wrong?
I would like to extend my gratitude for all the assistance and advice provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant