Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] the alignment with official accuracy in llama3.2-vision #493

Open
droidXrobot opened this issue Sep 29, 2024 · 8 comments
Open
Labels
help wanted Extra attention is needed

Comments

@droidXrobot
Copy link

No description provided.

@shan23chen
Copy link

Does the repo support this model yet? Thanks!

@FangXinyu-0913
Copy link
Collaborator

Hi @droidXrobot @shan23chen! This repo now supports Llama-3.2-11B/90B-Vision-Instruct, you can use it with the newest transformers version (>=4.45.0.dev0)!
However, the evaluation results obtained based on the current repo do not match the official results, and after the hyperparameters and the system prompt are aligned, there is still the problem of more dropped accuracy (mainly for ai2d). Is there anyone willing to solve this problem?

Ref:
https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/generation_config.json
https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md

@FangXinyu-0913 FangXinyu-0913 added the help wanted Extra attention is needed label Oct 3, 2024
@FangXinyu-0913 FangXinyu-0913 changed the title Someone please add Llama 3.2 11b to the leaderboard [Help Wanted] the alignment with official accuracy in llama3.2-vision Oct 3, 2024
@luohao123
Copy link

Actually, all my benchmark can not align same model as before...

THIS REPO UPDATE TOO QUICK... Many things might casued not alignment..

@kennymckormick
Copy link
Member

Actually, all my benchmark can not align same model as before...

THIS REPO UPDATE TOO QUICK... Many things might casued not alignment..

Would you please provide more information, such as the corresponding commit ID of the previous and current code you used for evaluation, as well as the model & benchmark you have evaluated?

@luohao123
Copy link

luohao123 commented Oct 9, 2024

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

  1. The tsv file new generated;
  2. This operation not have before, I donkt now what is this:
    image
    and it is slow
  3. the metric now all lower, on all benchmarks, same model
  4. I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

@kennymckormick
Copy link
Member

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

  1. The tsv file new generated;
  2. This operation not have before, I donkt now what is this:
    image
    and it is slow
  3. the metric now all lower, on all benchmarks, same model
  4. I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

At least, you need to provide some information so that we can help. Please tell me the model you are using, one / several benchmarks you are evaluating. If you cannot find out the initial commit you are using, please try to remember when you first use this codebase.

@kennymckormick
Copy link
Member

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

  1. The tsv file new generated;
  2. This operation not have before, I donkt now what is this:
    image
    and it is slow
  3. the metric now all lower, on all benchmarks, same model
  4. I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

Same Here: #503 (comment)

Also, if you want to go further with this problem, maybe creating a new issue is a better idea. You problem is not related to the issue of llama-3.2.

@terry-for-github
Copy link

As for user, we can not compare each commit to see what changed, it's your responsibility.
The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

  1. The tsv file new generated;
  2. This operation not have before, I donkt now what is this:
    image
    and it is slow
  3. the metric now all lower, on all benchmarks, same model
  4. I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,
afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.
Any suggestion?

Same Here: #503 (comment)

Also, if you want to go further with this problem, maybe creating a new issue is a better idea. You problem is not related to the issue of llama-3.2.

@kennymckormick Same issue here too. #523 Could you please check this one? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants