Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

Open
mrmhodak opened this issue Jan 7, 2025 · 8 comments
Open

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

mrmhodak opened this issue Jan 7, 2025 · 8 comments

Comments

@mrmhodak
Copy link
Contributor

mrmhodak commented Jan 7, 2025

When running reference implementation on H200, I see an accuracy failure:

Metric Target Score H200 Reference Implementation Percentage Diff
rouge1 45.5989 45.127 1.034893386
rouge2 23.3526 22.9785 1.601962951
rougeL 30.4608 30.4806 0.065001576
gsm8k 73.66 74.06 0.543035569
mbxp 60.16 60.22 0.099734043
tokens per sample 144.84 283.5 95.73322287
@mrmhodak
Copy link
Contributor Author

mrmhodak commented Jan 7, 2025

@pgmpablo157321 @nvzhihanj @arjunsuresh : Any comments?

@arjunsuresh
Copy link
Contributor

Hi @mrmhodak we are running the full accuracy run for this. But it won't be finishing until Thursday.

@nvzhihanj
Copy link
Contributor

We did the dataset update for Mixtral this round (for the EOS issue). Were you running on the latest dataset and latest settings (i.e. min_output_len=2)?
We will launch a local run to verify as well

@mrmhodak
Copy link
Contributor Author

mrmhodak commented Jan 7, 2025

@nvzhihanj : Yes, all latest, freshly downloaded according to latest instructions using rclone.

@mrmhodak
Copy link
Contributor Author

@arjunsuresh @nvzhihanj @pgmpablo157321: Any update on this?

@nvzhihanj
Copy link
Contributor

I am able to re-run the standalone script and double-check the accuracy of the model

Evaluating GSM8K score...
EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604
Evaluating OpenOrca score...
OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656
Evaluating MBXP score...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [02:33<00:00, 32.50it/s]
Processed 5000 in 153.89411109898356s
 60.16% pass@1
{'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469}  out of  {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834}
gen_tokens_per_sample: 98.7026

The bug must be in the reference implementation FYI @pgmpablo157321 , I will check in the standalone script to the repo later.
One thing: please make sure you use the checkpoint downloaded from the mlcommon cloud, not the public one.

@arjunsuresh arjunsuresh changed the title maixtral-8x7b: Reference Implementation Accuracy Failure on H200 mixtral-8x7b: Reference Implementation Accuracy Failure on H200 Jan 13, 2025
@nvzhihanj
Copy link
Contributor

I added the reference standalone scripts in #2029 and formalize the docker workflow. For the reference implementation, @pgmpablo157321 can you help the discrepancy between the standalone and the existing code?

@pgmpablo157321
Copy link
Contributor

@nvzhihanj Working on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants