Skip to content

Latest commit

 

History

History
65 lines (65 loc) · 11.7 KB

zebra-grid.summary.md

File metadata and controls

65 lines (65 loc) · 11.7 KB
Model Mode N_Mode N_Size Puzzle Acc Easy Puzzle Acc Hard Puzzle Acc Cell Acc No answer Total Puzzles Reason Lens
o1-2024-12-17 greedy single 1 81 98.21 74.31 78.74 0.2 1000 1197.51
deepseek-R1 greedy single 1 78.7 98.57 70.97 80.54 0 1000 586.33
o1-preview-2024-09-12 greedy single 1 71.4 98.57 60.83 75.14 0.3 1000 1565.88
o1-preview-2024-09-12-v2 greedy single 1 70.4 98.21 59.58 74.18 0.4 1000 1559.71
o1-mini-2024-09-12-v3 greedy single 1 59.7 86.07 49.44 70.32 1 1000 1166.38
o1-mini-2024-09-12-v2 greedy single 1 56.8 82.86 46.67 69.87 1.3 1000 1164.95
o1-mini-2024-09-12 greedy single 1 52.6 87.14 39.17 52.29 0.8 1000 993.28
deepseek-v3 greedy single 1 42.1 90 23.47 42.04 27.9 1000 2158
claude-3-5-sonnet-20241022 greedy single 1 36.2 91.07 14.86 54.27 0 1000 861.18
claude-3-5-sonnet-20240620 greedy single 1 33.4 87.5 12.36 54.34 0 1000 1141.94
Llama-3.1-405B-Inst-fp8@together greedy single 1 32.6 87.14 11.39 45.8 12.5 1000 314.66
gpt-4o-2024-08-06 greedy single 1 31.7 84.64 11.11 50.34 3.6 1000 1106.51
gemini-1.5-pro-exp-0827 greedy single 1 30.5 79.64 11.39 50.84 0.8 1000 1594.47
Llama-3.1-405B-Inst@sambanova greedy single 1 30.1 84.64 8.89 39.06 24.7 1000 2001.12
chatgpt-4o-latest-24-09-07 greedy single 1 29.9 81.43 9.86 48.83 4.2 1000 1539.99
Mistral-Large-2 greedy single 1 29 80.36 9.03 47.64 1.7 1000 1592.39
gpt-4-turbo-2024-04-09 greedy single 1 28.4 80.71 8.06 47.9 0.1 1000 1148.46
gpt-4o-2024-05-13 greedy single 1 28.2 77.86 8.89 38.72 19.3 1000 1643.51
grok-2-1212 greedy single 1 27.7 76.43 8.75 48.16 3.5 1000 2551.39
gpt-4-0314 greedy single 1 27.1 77.14 7.64 47.43 0.2 1000 1203.17
claude-3-opus-20240229 greedy single 1 27 78.21 7.08 48.91 0 1000 855.72
Qwen2.5-72B-Instruct greedy single 1 26.6 76.43 7.22 40.92 11.9 1000 1795.9
Qwen2.5-32B-Instruct greedy single 1 26.1 77.5 6.11 43.39 6.3 1000 1333.07
gemini-1.5-pro-exp-0801 greedy single 1 25.2 72.5 6.81 48.5 0 1000 1389.75
Llama-3.1-405B-Inst@hyperbolic greedy single 1 25 66.67 15.38 46.62 6.25 16 1517.13
gemini-1.5-flash-exp-0827 greedy single 1 25 70.71 7.22 43.56 8.5 1000 1705.11
Meta-Llama-3.1-70B-Instruct greedy single 1 24.9 73.57 5.97 27.98 43 1000 1483.68
deepseek-v2-chat-0628 greedy single 1 22.7 68.57 4.86 42.46 5.2 1000 1260.23
deepseek-v2.5-0908 greedy single 1 22.1 68.21 4.17 38.01 12.7 1000 1294.46
Qwen2-72B-Instruct greedy single 1 21.4 63.93 4.86 38.32 10.2 1000 1813.82
deepseek-v2-coder-0614 greedy single 1 21.1 64.64 4.17 41.58 4.9 1000 1324.55
deepseek-v2-coder-0724 greedy single 1 20.5 61.79 4.44 42.35 3.4 1000 1230.63
gpt-4o-mini-2024-07-18 greedy single 1 20.1 62.5 3.61 41.26 0.1 1000 943.52
gemini-1.5-flash greedy single 1 19.4 59.29 3.89 31.77 22.7 1000 1538.18
gemini-1.5-pro greedy single 1 19.4 55.71 5.28 44.59 0.8 1000 1336.17
yi-large-preview greedy single 1 18.9 58.93 3.33 42.61 1.4 1000 833.36
yi-large greedy single 1 18.8 58.21 3.47 39.83 1.8 1000 757.01
claude-3-5-haiku-20241022 greedy single 1 18.7 57.86 3.47 43.22 0.1 1000 660.91
claude-3-sonnet-20240229 greedy single 1 18.7 58.93 3.06 43.66 0 1000 1095.37
Meta-Llama-3-70B-Instruct greedy single 1 16.8 52.86 2.78 42.31 0.2 1000 809.95
Athene-70B greedy single 1 16.7 52.5 2.78 32.98 21.1 1000 391.19
gemma-2-27b-it greedy single 1 16.3 50.71 2.92 41.18 1.1 1000 1014.56
claude-3-haiku-20240307 greedy single 1 14.3 47.86 1.25 37.87 0.1 1000 1015.06
command-r-plus greedy single 1 13.9 44.64 1.94 39.01 0.2 1000 810.53
reka-core-20240501 greedy single 1 13 43.21 1.25 33.88 4 1000 1078.29
gemma-2-9b-it greedy single 1 12.8 41.79 1.53 36.79 0 1000 849.84
Meta-Llama-3.1-8B-Instruct greedy single 1 12.8 43.57 0.83 13.68 61.5 1000 1043.9
Qwen2.5-7B-Instruct greedy single 1 12 38.93 1.53 30.67 9.5 1000 850.93
Meta-Llama-3-8B-Instruct greedy single 1 11.9 40.71 0.69 23.7 29.2 1000 1216.4
Mistral-Nemo-Instruct-2407 greedy single 1 11.8 38.93 1.25 34.93 1.6 1000 925.88
Phi-3-mini-4k-instruct greedy single 1 11.6 38.21 1.25 13.5 59 1000 790.29
Yi-1.5-34B-Chat greedy single 1 11.5 37.5 1.39 32.73 4.4 1000 869.65
gpt-3.5-turbo-0125 greedy single 1 10.1 33.57 0.97 33.06 0.1 1000 820.66
command-r greedy single 1 9.9 32.14 1.25 32.66 1.5 1000 1005.17
reka-flash-20240226 greedy single 1 9.3 30.71 0.97 25.67 18.7 1000 1074.8
mathstral-7B-v0.1 greedy single 1 9 30 0.83 20.42 36 1000 1148.16
Mixtral-8x7B-Instruct-v0.1 greedy single 1 8.7 28.93 0.83 26.47 20.3 1000 1177.21
Qwen2-7B-Instruct greedy single 1 8.4 29.29 0.28 22.06 24.4 1000 1473.23
Llama-3.2-3B-Instruct@together greedy single 1 7.4 25.71 0.28 13.14 54.5 1000 963.47
Phi-3.5-mini-instruct greedy single 1 6.4 21.79 0.42 5.98 80.6 1000 718.43
Qwen2.5-3B-Instruct greedy single 1 4.8 17.14 0 11.44 56.7 1000 906.58
gemma-2-2b-it greedy single 1 4.2 14.29 0.28 9.97 57.2 1000 1032.89
Yi-1.5-9B-Chat greedy single 1 2.3 8.21 0 7.53 11.3 1000 1592.6