Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LLaVA-Phi-Mini-3 #12

Closed
wants to merge 5 commits into from
Closed

Adding LLaVA-Phi-Mini-3 #12

wants to merge 5 commits into from

Conversation

s-smits
Copy link

@s-smits s-smits commented Apr 25, 2024

The model almost works, the chat template however is wrong. Maybe wait for an official update from HF for apply_chat_template?

It works, but the prompt template apparently is not correct. Maybe wait for an official update from HF for apply_chat_template?

air@MacBook-Air-van-Air mlx-vlm % python3.10 -m mlx_vlm.generate --model xtuner/llava-phi-3-mini-hf --max-tokens 100 --temp 0.0                       
model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.31G/3.31G [11:01<00:00, 5.01MB/s]
model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [13:49<00:00, 6.01MB/s]
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [13:49<00:00, 75.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [13:49<00:00, 10.5MB/s]
==========
Image: http://images.cocodataset.org/val2017/000000039769.jpg 

Prompt: <s><|user|>
<image>
What are these?<|end|>
<|assistant|>

These are two cats sleeping on a pink couch.<|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|>
==========
Prompt: 0.969 tokens-per-sec
Generation: 1.355 tokens-per-sec
@Blaizzy
Copy link
Owner

Blaizzy commented Apr 25, 2024

Awesome, great work!

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 25, 2024

The model almost works, the chat template however is wrong. Maybe wait for an official update from HF for apply_chat_template?

What do you mean by the chat template is wrong?
Could you elaborate?

@s-smits
Copy link
Author

s-smits commented Apr 25, 2024

Thank you!
The output is the following:
These are two cats sleeping on a pink couch.<|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|><|end|>
From my experience this normally is a chat template problem, where the model does not know when to stop. What do you think?

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 25, 2024

Thank you! The output is the following: These are two cats sleeping on a pink couch.<|end|><|end|><|end|><|end|><|end|><|end|><|end|>... From my experience this normally is a chat template problem, where the model does not know when to stop. What do you think?

Hmm, the model is working fine. The problem exists because eos_token is different from the one used in the chat_template.

If you patch the tokenizer it will work :)

tokenizer.eos_token = "<|end|>"

@s-smits
Copy link
Author

s-smits commented Apr 25, 2024

Tried two options:

Option 1
Add
self.tokenizer.eos_token = "<|end|>"
to language.LanguageModel init

Option 2

    model.language_model.tokenizer.eos_token = "<|end|>"
    model.language_model.tokenizer.pad_token = "<|end|>"

in llavaPhi.Model from_pretrained at the end before return model.

Unfortunately both did not work.

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 26, 2024

Tokenizer is in the processor.

Llava type models usually have it setup this way processor.tokenizer while others, such as nanollava have it as the processor itself.

Check:

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/utils.py#L714-L719

@s-smits
Copy link
Author

s-smits commented Apr 26, 2024

Got it working with
utils.py 714:723

    if image_processor is not None:
        prompt_tokens = mx.array(processor.encode(prompt))
        tokenizer = processor
    else:
        prompt_tokens = mx.array(processor.tokenizer.encode(prompt))
        tokenizer = processor.tokenizer
    
    if model.config.text_config.model_type == 'llama':
        processor.tokenizer.eos_token = "<|end|>"
        processor.tokenizer.pad_token = "<|end|>"

However, this also overrides the llava-1.5 model right? So I tried to add an extra condition, but I cannot call model_type == 'llava':

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/air/Repositories/mlx-vlm/mlx_vlm/generate.py", line 95, in <module>
    main()
  File "/Users/air/Repositories/mlx-vlm/mlx_vlm/generate.py", line 82, in main
    generate(
  File "/Users/air/Repositories/mlx-vlm/mlx_vlm/utils.py", line 724, in generate
    print('model type',model.config.model_type)
AttributeError: 'ModelConfig' object has no attribute 'model_type'
{
--
  | "architectures": [
  | "LlavaForConditionalGeneration"
  | ],
  | "ignore_index": -100,
  | "image_token_index": 32038,
  | "model_type": "llava",
  | "pad_token_id": 32039,
  | "projector_hidden_act": "gelu",
  | "text_config": {
  | "architectures": [
  | "LlamaForCausalLM"
  | ],
  | "eos_token_id": 32000,
  | "hidden_size": 3072,
  | "intermediate_size": 8192,
  | "max_position_embeddings": 4096,
  | "model_type": "llama",
  | "original_max_position_embeddings": 4096,
  | "pad_token_id": 32000,
  | "rms_norm_eps": 1e-05,
  | "sliding_window": 2048,
  | "torch_dtype": "float16",
  | "vocab_size": 32064
  | },
  | "torch_dtype": "float16",
  | "transformers_version": "4.40.1",
  | "vision_config": {
  | "architectures": [
  | "CLIPVisionModel"
  | ],
  | "dropout": 0.0,
  | "hidden_size": 1024,
  | "image_size": 336,
  | "intermediate_size": 4096,
  | "model_type": "clip_vision_model",
  | "num_attention_heads": 16,
  | "num_hidden_layers": 24,
  | "patch_size": 14,
  | "projection_dim": 768,
  | "torch_dtype": "float32"
  | },
  | "vision_feature_layer": -2,
  | "vision_feature_select_strategy": "default"
  | }

If you know how to call the right component where model config main architecture == 'llama', could you implement this? Then we can merge.

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 26, 2024

Got it working with utils.py 714:723

    if image_processor is not None:
        prompt_tokens = mx.array(processor.encode(prompt))
        tokenizer = processor
    else:
        prompt_tokens = mx.array(processor.tokenizer.encode(prompt))
        tokenizer = processor.tokenizer
    
    if model.config.text_config.model_type == 'llama':
        processor.tokenizer.eos_token = "<|end|>"
        processor.tokenizer.pad_token = "<|end|>"

Great Job!

Let's keep everything as it was.

I would prefer if we patch the tokenizer directly and save it once we convert the model in the MLX hub :)

That way we don't have to add that extra logic here.

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 26, 2024

Btw, I just added tests recently #13.

If you could add the test cases once you finish it would be great.

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 26, 2024

Please don't forget to rename the folder to llavaPhi3

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 29, 2024

Closes #11

@Blaizzy
Copy link
Owner

Blaizzy commented Apr 30, 2024

Hey @s-smits,

Thank you very much for your contributions!

During my tests I found that this model uses the existing llava implementation. I will close this PR.

You can use the already pre-quantized model in the hub:

Just install the latest version:
pip install -U mlx-vlm

@Blaizzy Blaizzy closed this Apr 30, 2024
@s-smits
Copy link
Author

s-smits commented May 1, 2024

You're welcome. Due to other projects I couldn't finish it in time, thank you for converting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants