mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

mrmhodak · 2025-01-07T05:52:52Z

When running reference implementation on H200, I see an accuracy failure:

Metric	Target Score	H200 Reference Implementation	Percentage Diff
rouge1	45.5989	45.127	1.034893386
rouge2	23.3526	22.9785	1.601962951
rougeL	30.4608	30.4806	0.065001576
gsm8k	73.66	74.06	0.543035569
mbxp	60.16	60.22	0.099734043
tokens per sample	144.84	283.5	95.73322287

mrmhodak · 2025-01-07T05:53:25Z

@pgmpablo157321 @nvzhihanj @arjunsuresh : Any comments?

arjunsuresh · 2025-01-07T15:12:59Z

Hi @mrmhodak we are running the full accuracy run for this. But it won't be finishing until Thursday.

nvzhihanj · 2025-01-07T17:13:46Z

We did the dataset update for Mixtral this round (for the EOS issue). Were you running on the latest dataset and latest settings (i.e. min_output_len=2)?
We will launch a local run to verify as well

mrmhodak · 2025-01-07T17:29:41Z

@nvzhihanj : Yes, all latest, freshly downloaded according to latest instructions using rclone.

mrmhodak · 2025-01-10T07:11:21Z

@arjunsuresh @nvzhihanj @pgmpablo157321: Any update on this?

nvzhihanj · 2025-01-13T00:19:51Z

I am able to re-run the standalone script and double-check the accuracy of the model

Evaluating GSM8K score...
EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604
Evaluating OpenOrca score...
OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656
Evaluating MBXP score...
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5000/5000 [02:33<00:00, 32.50it/s]
Processed 5000 in 153.89411109898356s
 60.16% pass@1
{'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469}  out of  {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834}
gen_tokens_per_sample: 98.7026

The bug must be in the reference implementation FYI @pgmpablo157321 , I will check in the standalone script to the repo later.
One thing: please make sure you use the checkpoint downloaded from the mlcommon cloud, not the public one.

nvzhihanj · 2025-01-13T20:07:53Z

I added the reference standalone scripts in #2029 and formalize the docker workflow. For the reference implementation, @pgmpablo157321 can you help the discrepancy between the standalone and the existing code?

pgmpablo157321 · 2025-01-13T20:57:59Z

@nvzhihanj Working on this

arjunsuresh changed the title ~~maixtral-8x7b: Reference Implementation Accuracy Failure on H200~~ mixtral-8x7b: Reference Implementation Accuracy Failure on H200 Jan 13, 2025

nvzhihanj mentioned this issue Jan 13, 2025

Add mixtral dockerfile and standalone inference script #2029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

mrmhodak commented Jan 7, 2025

mrmhodak commented Jan 7, 2025

arjunsuresh commented Jan 7, 2025

nvzhihanj commented Jan 7, 2025

mrmhodak commented Jan 7, 2025

mrmhodak commented Jan 10, 2025

nvzhihanj commented Jan 13, 2025

nvzhihanj commented Jan 13, 2025

pgmpablo157321 commented Jan 13, 2025

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

mixtral-8x7b: Reference Implementation Accuracy Failure on H200 #2018

Comments

mrmhodak commented Jan 7, 2025

mrmhodak commented Jan 7, 2025

arjunsuresh commented Jan 7, 2025

nvzhihanj commented Jan 7, 2025

mrmhodak commented Jan 7, 2025

mrmhodak commented Jan 10, 2025

nvzhihanj commented Jan 13, 2025

nvzhihanj commented Jan 13, 2025

pgmpablo157321 commented Jan 13, 2025