Model | Mode | Acc | No answer | Total | Reason Lens |
---|---|---|---|---|---|
o1-preview-2024-09-12 | greedy | 95.88 | 1.25 | 800 | 504.17 |
o1-mini-2024-09-12 | greedy | 93.75 | 0.38 | 800 | 468.3 |
gpt-4o-2024-08-06 | greedy | 87 | 0.12 | 800 | 760.7 |
chatgpt-4o-latest-24-09-07 | greedy | 86.5 | 0 | 800 | 683.45 |
gpt-4o-2024-05-13 | greedy | 86.12 | 0.25 | 800 | 611.46 |
claude-3-5-sonnet-20241022 | greedy | 83.88 | 0 | 800 | 466.25 |
claude-3-5-sonnet-20240620 | greedy | 80.75 | 0 | 800 | 518.28 |
gemini-1.5-pro-exp-0827 | greedy | 79.62 | 0.25 | 800 | 581.76 |
gpt-4-turbo-2024-04-09 | greedy | 78.88 | 0 | 800 | 566.57 |
gpt-4o-mini-2024-07-18 | greedy | 75.88 | 0.12 | 800 | 391.32 |
grok-2-1212 | greedy | 75.25 | 0 | 800 | 633.1 |
Mistral-Large-2 | greedy | 75.12 | 0.25 | 800 | 469.91 |
gemini-1.5-pro-exp-0801 | greedy | 74.88 | 0.12 | 800 | 436.82 |
gpt-4-0314 | greedy | 74.5 | 0 | 800 | 404.28 |
gemini-1.5-flash-exp-0827 | greedy | 74.5 | 0.38 | 800 | 631.85 |
Llama-3.1-405B-Inst-fp8@together | greedy | 74.12 | 2.62 | 800 | 300.62 |
Qwen2.5-72B-Instruct | greedy | 73.88 | 0 | 800 | 531.01 |
Llama-3.1-405B-Inst@hyperbolic | greedy | 73.5 | 1.12 | 800 | 345.76 |
Llama-3.1-405B-Inst@sambanova | greedy | 73 | 0.12 | 800 | 414.28 |
deepseek-v2-chat-0628 | greedy | 70.5 | 0 | 800 | 568.12 |
claude-3-opus-20240229 | greedy | 70.38 | 0 | 800 | 521.62 |
deepseek-v2.5-0908 | greedy | 70 | 0.12 | 800 | 524.02 |
Qwen2.5-32B-Instruct | greedy | 69.88 | 0.38 | 800 | 545.23 |
deepseek-v2-coder-0724 | greedy | 69.5 | 0 | 800 | 564.88 |
claude-3-5-haiku-20241022 | greedy | 68.75 | 0 | 800 | 486.32 |
gemini-1.5-pro | greedy | 68 | 0.25 | 800 | 385.66 |
claude-3-sonnet-20240229 | greedy | 66.62 | 0 | 800 | 749.15 |
Meta-Llama-3.1-70B-Instruct | greedy | 64.25 | 0.5 | 800 | 493.74 |
gemini-1.5-flash | greedy | 63.75 | 0.25 | 800 | 514.44 |
yi-large-preview | greedy | 60.38 | 0 | 800 | 689.52 |
yi-large | greedy | 60.25 | 0 | 800 | 628.25 |
Qwen2-72B-Instruct | greedy | 59.13 | 0 | 800 | 444.5 |
Meta-Llama-3-70B-Instruct | greedy | 58.88 | 0 | 800 | 431.53 |
gemma-2-27b-it | greedy | 57.25 | 0 | 800 | 421.67 |
claude-3-haiku-20240307 | greedy | 54.75 | 0.12 | 800 | 708.22 |
gpt-3.5-turbo-0125 | greedy | 54.75 | 0.25 | 800 | 405.27 |
Qwen2.5-7B-Instruct | greedy | 52.75 | 0.5 | 800 | 531.07 |
Athene-70B | greedy | 50.62 | 0 | 800 | 283.62 |
reka-core-20240501 | greedy | 46.25 | 0 | 800 | 525.5 |
gemma-2-9b-it | greedy | 46 | 0 | 800 | 484.51 |
Mixtral-8x7B-Instruct-v0.1 | greedy | 44.88 | 0.25 | 800 | 463.08 |
Phi-3-mini-4k-instruct | greedy | 44.75 | 0.75 | 800 | 539.63 |
Yi-1.5-9B-Chat | greedy | 44.75 | 1.62 | 800 | 593.18 |
Yi-1.5-34B-Chat | greedy | 44.12 | 0 | 800 | 561.47 |
Phi-3.5-mini-instruct | greedy | 42.12 | 3 | 800 | 625.13 |
Meta-Llama-3.1-8B-Instruct | greedy | 39.88 | 0.62 | 800 | 535.85 |
Qwen2-7B-Instruct | greedy | 37.88 | 0.12 | 800 | 368.51 |
Meta-Llama-3-8B-Instruct | greedy | 37.75 | 0.25 | 800 | 411.52 |
reka-flash-20240226 | greedy | 34.12 | 0 | 800 | 565.61 |
Qwen2.5-3B-Instruct | greedy | 33.12 | 1 | 800 | 502.87 |
gemma-2-2b-it | greedy | 21.5 | 0 | 800 | 351.05 |