Skip to content

pfnet-research/pfgen-bench

Repository files navigation

Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on providing numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models mentioned in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: TBD (arxiv)

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら: Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM output

The license of the parts of this repository other than the output of LLM is Apache License Version 2.0. The license of the output of LLM depends on the license of each model.

How to evaluate model

You can evaluate the model using run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which is the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=pfnet/plamo-13b --num-trials=5

# Evaluate output and update leaderboard.
make

How to contribute

Follow the instructions in the "How to Evaluate Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

Rank Score                    Model                                       Length           Fluency Truthfulness Helpfulness
N/A 1.0501 (±0.0000/√1) 👑 system/ground-truth 100.0 (±0.0) 1.155 0.996 1.000
1 0.9303 (±0.0083/√10) 💬 anthropic/claude-3-5-sonnet-20240620 102.2 (±10.4) 0.949 0.959 0.883
2 0.9144 (±0.0037/√2) 💬 deepseek-ai/DeepSeek-V3 87.4 (±14.9) 0.960 0.983 0.800
3 0.8615 (±0.0092/√10) 💬 openai/gpt-4o 84.5 (±18.6) 0.919 0.980 0.686
4 0.8584 (±0.0163/√10) 💬 deepseek-ai/DeepSeek-R1 106.1 (±13.5) 0.839 0.929 0.807
N/A 0.8494 (±0.0253/√1000) 🎯 system/criteria 100.0 (±3.4) 0.936 0.978 0.505
5 0.8359 (±0.0216/√10) 💬 Qwen/Qwen-Max-2025-01-25 89.6 (±18.7) 0.864 0.968 0.676
6 0.8352 (±0.0107/√10) 💬 Qwen/Qwen-Max 88.8 (±18.7) 0.862 0.964 0.679
7 0.8279 (±0.0131/√10) 💬 MiniMax-Text-01 77.8 (±22.2) 0.858 0.988 0.638
8 0.8270 (±0.0229/√10) 💬 anthropic/claude-3-opus-20240229 102.3 (±9.5) 0.911 0.944 0.627
9 0.8192 (±0.0207/√10) 💬 google/gemini-1.5-pro-002 76.3 (±17.4) 0.826 0.976 0.656
10 0.8157 (±0.0119/√10) 💬 MiniMax-Text-01 78.9 (±25.5) 0.850 0.986 0.611
11 0.8036 (±0.0133/√10) 💬 openai/gpt-4-turbo 86.5 (±17.4) 0.820 0.959 0.632
12 0.7916 (±0.0146/√10) 💬 openai/gpt-4 107.2 (±11.6) 0.888 0.951 0.536
13 0.7827 (±0.0129/√100) 💬 Qwen/Qwen2.5-72B-Instruct 98.7 (±14.8) 0.871 0.936 0.540
14 0.7789 (±0.0213/√100) 🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 109.1 (±36.8) 0.890 0.941 0.506
15 0.7782 (±0.0154/√100) 💬 Qwen/Qwen2.5-72B-Instruct 96.5 (±17.8) 0.847 0.939 0.549
16 0.7773 (±0.0168/√100) 💬 pfnet/plamo-1.0-prime 178.2 (±114.5) 0.874 0.942 0.516
17 0.7768 (±0.0113/√5) 💬 mlx-community/Qwen2.5-72B-Instruct-4bit 100.8 (±17.7) 0.860 0.933 0.538
18 0.7766 (±0.0276/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-hf 104.1 (±17.9) 0.884 0.938 0.507
19 0.7756 (±0.0264/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-instruc... 104.1 (±18.5) 0.878 0.938 0.510
20 0.7748 (±0.0000/√1) 💬 openai/chatgpt-o1 76.3 (±17.7) 0.755 0.960 0.610
21 0.7650 (±0.0263/√100) 🟢 tokyotech-llm/Swallow-70b-instruct-hf 102.5 (±14.4) 0.872 0.929 0.494
22 0.7643 (±0.0000/√1) 💬 openai/chatgpt-o1-pro 79.5 (±17.3) 0.748 0.955 0.590
23 0.7628 (±0.0275/√100) 🟢 tokyotech-llm/Swallow-70b-hf 103.5 (±16.1) 0.876 0.930 0.483
24 0.7601 (±0.0289/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 106.3 (±21.0) 0.864 0.925 0.492
25 0.7538 (±0.0251/√100) 🟢 turing-motors/Llama-3-heron-brain-70B... 101.1 (±16.9) 0.857 0.925 0.479
26 0.7501 (±0.0237/√100) 💬 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 181.0 (±87.4) 0.847 0.923 0.480
27 0.7469 (±0.0270/√100) 🟢 pfnet/plamo-100b-base 115.2 (±64.0) 0.861 0.920 0.460
28 0.7444 (±0.0260/√100) 🟢 sbintuitions/sarashina2-70b 120.0 (±49.4) 0.825 0.923 0.485
29 0.7423 (±0.0302/√100) 💬 cyberagent/Llama-3.1-70B-Japanese-Ins... 199.2 (±110.3) 0.817 0.905 0.505
30 0.7407 (±0.0170/√10) 💬 google/gemini-1.5-flash-002 68.4 (±20.2) 0.742 0.960 0.519
31 0.7392 (±0.0232/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 93.6 (±23.5) 0.847 0.941 0.429
32 0.7370 (±0.0217/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 97.5 (±19.8) 0.846 0.932 0.433
33 0.7365 (±0.0218/√100) 🟢 CohereForAI/c4ai-command-r-plus 107.5 (±42.3) 0.818 0.913 0.478
34 0.7336 (±0.0254/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 108.2 (±24.7) 0.837 0.908 0.456
35 0.7320 (±0.0201/√10) 💬 anthropic/claude-3-sonnet-20240229 114.3 (±18.9) 0.810 0.910 0.476
36 0.7273 (±0.0233/√10) 💬 google/gemini-2.0-flash-exp 60.7 (±16.3) 0.727 0.978 0.476
37 0.7249 (±0.0247/√100) 💬 cyberagent/calm3-22b-chat 136.8 (±46.7) 0.813 0.907 0.455
38 0.7246 (±0.0250/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 89.8 (±33.9) 0.812 0.940 0.422
39 0.7217 (±0.0219/√100) 🟢 cyberagent/calm3-22b-chat 105.0 (±13.1) 0.824 0.916 0.425
40 0.7194 (±0.0321/√10) 💬 google/text-bison 77.6 (±31.9) 0.790 0.968 0.401
41 0.7185 (±0.0000/√1) 💬 elyza/Llama-3-ELYZA-JP-70B 98.6 (±33.8) 0.837 0.931 0.388
42 0.7175 (±0.0257/√100) 🟢 nvidia/nemotron-4-340b-instruct 107.3 (±28.4) 0.816 0.908 0.429
43 0.7084 (±0.0207/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 95.9 (±19.7) 0.835 0.930 0.360
44 0.7046 (±0.0248/√100) 💬 nvidia/nemotron-4-340b-instruct 94.5 (±39.1) 0.768 0.910 0.435
45 0.7024 (±0.0238/√100) 🟢 rinna/nekomata-14b 104.3 (±18.0) 0.812 0.912 0.383
46 0.7023 (±0.0271/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 112.6 (±33.2) 0.818 0.901 0.388
47 0.7008 (±0.0318/√100) 🟢 tokyotech-llm/Swallow-13b-instruct-hf 104.5 (±13.0) 0.812 0.898 0.392
48 0.6990 (±0.0288/√100) 🟢 tokyotech-llm/Swallow-13b-NVE-hf 106.2 (±19.2) 0.820 0.906 0.371
49 0.6980 (±0.0252/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 98.7 (±50.0) 0.798 0.927 0.369
50 0.6958 (±0.0236/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 92.9 (±20.0) 0.814 0.931 0.343
51 0.6945 (±0.0300/√100) 🟢 sbintuitions/sarashina2-13b 107.8 (±28.3) 0.794 0.900 0.390
52 0.6938 (±0.0217/√100) 🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 111.5 (±22.8) 0.800 0.893 0.389
53 0.6924 (±0.0232/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 74.1 (±31.4) 0.755 0.948 0.373
54 0.6891 (±0.0255/√100) 🟢 tokyotech-llm/Swallow-13b-hf 104.8 (±17.7) 0.811 0.901 0.355
55 0.6853 (±0.0201/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 96.6 (±18.8) 0.815 0.919 0.322
56 0.6794 (±0.0243/√100) 🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... 128.8 (±72.2) 0.764 0.883 0.391
57 0.6759 (±0.0232/√10) 🟢 meta-llama/Meta-Llama-3.1-405B 101.2 (±15.1) 0.767 0.892 0.368
58 0.6737 (±0.0276/√100) 🟢 sbintuitions/sarashina1-13b 105.4 (±23.4) 0.775 0.882 0.364
59 0.6715 (±0.0284/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 107.5 (±22.2) 0.787 0.881 0.347
60 0.6697 (±0.0277/√100) 🟢 nvidia/nemotron-4-340b-base 106.9 (±26.5) 0.768 0.884 0.357
61 0.6677 (±0.0250/√100) 🟢 llm-jp/llm-jp-3-13b 101.1 (±9.7) 0.770 0.884 0.349
62 0.6673 (±0.0225/√100) 🟢 sbintuitions/sarashina1-65b 104.2 (±20.0) 0.776 0.894 0.332
63 0.6663 (±0.0262/√100) 🟢 tokyotech-llm/Swallow-7b-plus-hf 106.1 (±18.1) 0.780 0.880 0.339
64 0.6625 (±0.0140/√10) 💬 anthropic/claude-3-haiku-20240307 81.9 (±31.0) 0.747 0.943 0.298
65 0.6624 (±0.0000/√1) 💬 openai/chatgpt-o3-mini-high 68.1 (±14.5) 0.632 0.925 0.430
66 0.6616 (±0.0378/√10) 💬 google/gemini-1.0-pro-002 118.7 (±90.9) 0.689 0.894 0.402
67 0.6590 (±0.0133/√10) 💬 google/gemini-2.0-flash-thinking-exp-... 49.8 (±11.0) 0.639 0.984 0.354
68 0.6572 (±0.0518/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 108.9 (±63.7) 0.764 0.895 0.313
69 0.6473 (±0.0182/√100) 💬 Qwen/Qwen2-72B-Instruct 108.7 (±24.8) 0.703 0.853 0.386
70 0.6456 (±0.0255/√100) 🟢 sbintuitions/sarashina2-7b 105.6 (±22.8) 0.746 0.874 0.316
71 0.6447 (±0.0251/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 74.3 (±31.3) 0.706 0.934 0.294
72 0.6445 (±0.0241/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 110.3 (±28.4) 0.748 0.867 0.319
73 0.6420 (±0.0259/√100) 🟢 microsoft/phi-4 104.2 (±15.2) 0.754 0.864 0.309
74 0.6406 (±0.0139/√100) 💬 Qwen/QwQ-32B-Preview 119.1 (±72.2) 0.730 0.897 0.294
75 0.6399 (±0.1763/√100) 💬 turing-motors/Llama-3-heron-brain-70B... 155.4 (±101.8) 0.718 0.805 0.397
76 0.6368 (±0.0207/√100) 🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 105.5 (±21.0) 0.753 0.870 0.287
77 0.6350 (±0.0260/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-instruct... 104.0 (±16.9) 0.755 0.863 0.287
78 0.6337 (±0.0265/√100) 🟢 tokyotech-llm/Swallow-7b-hf 106.5 (±18.7) 0.746 0.866 0.289
79 0.6335 (±0.0252/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 103.2 (±16.6) 0.766 0.872 0.263
80 0.6318 (±0.0264/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... 119.2 (±74.3) 0.724 0.861 0.311
81 0.6310 (±0.0127/√100) 💬 Qwen/Qwen2.5-32B-Instruct 75.4 (±19.3) 0.634 0.898 0.360
82 0.6303 (±0.0252/√100) 🟢 cyberagent/calm2-7b-chat-dpo-experime... 110.0 (±24.3) 0.735 0.863 0.293
83 0.6297 (±0.0150/√100) 💬 Qwen/Qwen2.5-32B-Instruct 71.1 (±18.7) 0.634 0.906 0.349
84 0.6295 (±0.0226/√100) 💬 microsoft/phi-4 117.8 (±34.9) 0.706 0.843 0.340
85 0.6294 (±0.0267/√100) 💬 microsoft/phi-4 117.8 (±37.7) 0.705 0.846 0.337
86 0.6291 (±0.0207/√100) 💬 Qwen/QwQ-32B-Preview 229.6 (±135.9) 0.719 0.867 0.301
87 0.6285 (±0.0239/√100) 🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge 124.7 (±47.2) 0.725 0.866 0.295
88 0.6279 (±0.0252/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-hf 108.1 (±24.5) 0.747 0.870 0.267
89 0.6274 (±0.0772/√100) 🟢 rinna/nekomata-14b-instruction 98.3 (±24.2) 0.732 0.855 0.295
90 0.6267 (±0.0263/√100) 🟢 sbintuitions/sarashina1-7b 106.7 (±25.1) 0.737 0.866 0.276
91 0.6252 (±0.0246/√100) 🟢 karakuri-ai/karakuri-lm-70b-v0.1 106.0 (±27.0) 0.713 0.852 0.310
92 0.6202 (±0.0251/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.3 (±19.2) 0.733 0.848 0.280
93 0.6197 (±0.0258/√100) 🟢 stockmark/stockmark-13b 108.9 (±49.3) 0.727 0.860 0.272
94 0.6191 (±0.0284/√100) 🟢 stockmark/stockmark-13b-instruct 108.0 (±46.8) 0.720 0.859 0.278
95 0.6178 (±0.0230/√100) 🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 104.7 (±27.5) 0.706 0.842 0.306
96 0.6176 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-7b-instruct-hf 106.3 (±17.8) 0.716 0.851 0.285
97 0.6149 (±0.0153/√100) 💬 Qwen/Qwen2.5-14B-Instruct 76.5 (±18.4) 0.644 0.893 0.308
98 0.6136 (±0.0143/√10) 💬 openai/gpt-35-turbo 64.0 (±22.2) 0.658 0.944 0.239
99 0.6095 (±0.0225/√100) 💬 rinna/llama-3-youko-70b-instruct 135.3 (±46.8) 0.683 0.817 0.328
100 0.6091 (±0.0277/√100) 🟢 pfnet/nekomata-14b-pfn-qfin 85.1 (±28.4) 0.672 0.893 0.262
101 0.6087 (±0.1545/√100) 💬 tokyotech-llm/Swallow-70b-NVE-instruc... 135.7 (±74.0) 0.678 0.804 0.344
102 0.6063 (±0.0213/√100) 💬 Qwen/Qwen2.5-14B-Instruct 80.0 (±21.8) 0.639 0.889 0.290
103 0.6060 (±0.0238/√100) 🟢 Qwen/Qwen2-72B 105.5 (±23.5) 0.703 0.836 0.279
104 0.6037 (±0.0239/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf 105.7 (±16.4) 0.719 0.847 0.245
105 0.6030 (±0.0287/√100) 💬 karakuri-ai/karakuri-lm-8x7b-instruct... 197.4 (±72.1) 0.703 0.832 0.274
106 0.6029 (±0.0223/√100) 🟢 Qwen/Qwen2-72B-Instruct 106.0 (±26.7) 0.684 0.825 0.299
107 0.5987 (±0.0264/√100) 🟢 cyberagent/calm2-7b-chat 107.5 (±20.8) 0.701 0.843 0.253
108 0.5971 (±0.0235/√100) 🟢 stockmark/stockmark-100b 107.2 (±24.7) 0.709 0.842 0.240
109 0.5945 (±0.1370/√100) 💬 tokyotech-llm/Swallow-13b-instruct-hf 167.3 (±116.4) 0.670 0.790 0.323
110 0.5921 (±0.0211/√100) 🟢 elyza/Llama-3-ELYZA-JP-8B 115.6 (±44.8) 0.685 0.831 0.260
111 0.5832 (±0.0220/√100) 🟢 augmxnt/shisa-gamma-7b-v1 106.7 (±21.8) 0.706 0.831 0.213
112 0.5825 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-MS-7b-v0.1 106.4 (±25.9) 0.702 0.828 0.218
113 0.5811 (±0.0218/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 103.6 (±15.6) 0.675 0.816 0.252
114 0.5808 (±0.0220/√100) 🟢 stabilityai/japanese-stablelm-base-ga... 106.9 (±17.2) 0.690 0.822 0.230
115 0.5783 (±0.0217/√100) 🟢 microsoft/Phi-3-medium-4k-instruct 105.9 (±20.0) 0.675 0.826 0.234
116 0.5777 (±0.0228/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 105.2 (±14.5) 0.675 0.811 0.247
117 0.5754 (±0.0182/√100) 🟢 Xwin-LM/Xwin-LM-70B-V0.1 105.4 (±26.8) 0.681 0.833 0.213
118 0.5737 (±0.0209/√100) 🟢 microsoft/Phi-3-medium-128k-instruct 107.7 (±24.7) 0.674 0.825 0.223
119 0.5735 (±0.0216/√100) 🟢 google/gemma-2-9b-it 95.9 (±22.0) 0.674 0.837 0.209
120 0.5734 (±0.1980/√100) 💬 tokyotech-llm/Swallow-70b-instruct-hf 130.9 (±105.0) 0.636 0.758 0.326
121 0.5724 (±0.0209/√100) 🟢 rinna/llama-3-youko-70b 104.6 (±20.6) 0.681 0.826 0.210
122 0.5716 (±0.0230/√100) 🟢 sbintuitions/sarashina2.1-1b 116.9 (±41.3) 0.668 0.821 0.226
123 0.5712 (±0.0194/√100) 💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 244.4 (±49.3) 0.678 0.816 0.220
124 0.5710 (±0.0198/√100) 🟢 mistralai/Mistral-Small-24B-Instruct-... 114.2 (±30.2) 0.684 0.797 0.232
125 0.5710 (±0.0226/√100) 🟢 rinna/llama-3-youko-8b-instruct 111.6 (±23.4) 0.672 0.809 0.232
126 0.5659 (±0.0234/√100) 🟢 meta-llama/Meta-Llama-3.1-70B 103.7 (±20.1) 0.665 0.822 0.211
127 0.5656 (±0.0226/√100) 💬 meta-llama/Meta-Llama-3-70B-Instruct 110.2 (±36.4) 0.665 0.777 0.254
128 0.5646 (±0.0240/√100) 💬 microsoft/Phi-3-medium-4k-instruct 131.3 (±50.6) 0.633 0.807 0.253
129 0.5642 (±0.0261/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.1 (±19.5) 0.646 0.799 0.247
130 0.5620 (±0.0254/√100) 🟢 meta-llama/Meta-Llama-3-70B 102.0 (±17.2) 0.664 0.809 0.213
131 0.5590 (±0.0456/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 105.3 (±42.8) 0.648 0.794 0.235
132 0.5588 (±0.0230/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.6 (±17.0) 0.673 0.812 0.191
133 0.5574 (±0.0216/√100) 🟢 rinna/nekomata-7b 108.4 (±18.0) 0.678 0.816 0.178
134 0.5569 (±0.0244/√100) 🟢 rinna/llama-3-youko-8b 104.9 (±17.0) 0.670 0.813 0.188
135 0.5568 (±0.0200/√100) 🟢 meta-llama/Meta-Llama-3-70B-Instruct 111.8 (±55.9) 0.655 0.780 0.236
136 0.5562 (±0.0952/√100) 💬 stockmark/stockmark-13b-instruct 137.2 (±89.6) 0.633 0.798 0.238
137 0.5540 (±0.0773/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 101.9 (±38.4) 0.640 0.773 0.248
138 0.5537 (±0.0204/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... 114.4 (±48.5) 0.657 0.812 0.192
139 0.5516 (±0.1016/√100) 💬 cyberagent/calm2-7b-chat-dpo-experime... 181.1 (±120.1) 0.644 0.775 0.236
140 0.5511 (±0.0203/√100) 🟢 google/gemma-2-27b-it 110.3 (±56.8) 0.599 0.836 0.218
141 0.5500 (±0.0605/√100) 💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... 156.5 (±106.5) 0.633 0.780 0.237
142 0.5500 (±0.0467/√100) 💬 tokyotech-llm/Swallow-7b-instruct-hf 121.9 (±77.3) 0.612 0.812 0.225
143 0.5465 (±0.0244/√100) 🟢 SakanaAI/TinySwallow-1.5B-Instruct 105.0 (±26.9) 0.657 0.807 0.176
144 0.5437 (±0.0218/√100) 💬 Xwin-LM/Xwin-LM-70B-V0.1 200.7 (±63.1) 0.652 0.782 0.198
145 0.5436 (±0.0246/√100) 🟢 llm-jp/llm-jp-3-3.7b 101.3 (±10.4) 0.646 0.795 0.189
146 0.5432 (±0.0208/√100) 💬 CohereForAI/c4ai-command-r-plus 48.9 (±16.5) 0.505 0.931 0.194
147 0.5429 (±0.0238/√100) 🟢 meta-llama/Meta-Llama-3.1-70B-Instruct 157.6 (±221.7) 0.636 0.770 0.222
148 0.5387 (±0.0269/√100) 💬 rinna/llama-3-youko-8b-instruct 265.4 (±104.1) 0.635 0.771 0.210
149 0.5386 (±0.0215/√100) 💬 microsoft/Phi-3-medium-128k-instruct 91.9 (±44.7) 0.589 0.834 0.193
150 0.5377 (±0.0481/√100) 💬 meta-llama/Meta-Llama-3.1-70B-Instruct 135.8 (±194.8) 0.617 0.779 0.218
151 0.5349 (±0.0203/√100) 💬 google/gemma-2-27b-it 74.7 (±42.7) 0.545 0.874 0.186
152 0.5347 (±0.0188/√100) 🟢 rinna/youri-7b 107.6 (±16.3) 0.654 0.802 0.148
153 0.5316 (±0.0273/√100) 💬 lightblue/karasu-7B-chat 111.8 (±46.5) 0.621 0.800 0.174
154 0.5301 (±0.0476/√100) 💬 lightblue/karasu-7B-chat-plus 107.1 (±46.7) 0.615 0.798 0.178
155 0.5283 (±0.0309/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 117.7 (±61.8) 0.616 0.801 0.168
156 0.5283 (±0.0585/√100) 💬 lightblue/karasu-7B-chat-plus-unleashed 104.6 (±45.3) 0.614 0.794 0.177
157 0.5190 (±0.0203/√100) 🟢 mistralai/Mistral-Small-24B-Base-2501 107.2 (±32.7) 0.626 0.771 0.160
158 0.5179 (±0.0264/√100) 🟢 cyberagent/calm2-7b 106.0 (±26.2) 0.601 0.770 0.182
159 0.5164 (±0.0209/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 109.3 (±33.5) 0.606 0.788 0.155
160 0.5143 (±0.0212/√100) 🟢 llm-jp/llm-jp-13b-v2.0 104.1 (±11.2) 0.604 0.760 0.180
161 0.5143 (±0.0170/√100) 🟢 moneyforward/houou-instruction-7b-v3 112.2 (±37.8) 0.629 0.778 0.135
162 0.5122 (±0.0132/√100) 💬 Qwen/Qwen2.5-7B-Instruct 69.5 (±28.7) 0.557 0.847 0.132
163 0.5085 (±0.0160/√100) 🟢 moneyforward/houou-instruction-7b-v1 105.9 (±41.0) 0.617 0.781 0.128
164 0.5080 (±0.0306/√100) 💬 stabilityai/japanese-stablelm-instruc... 111.3 (±58.3) 0.548 0.782 0.195
165 0.5073 (±0.0208/√100) 💬 Qwen/Qwen2-57B-A14B-Instruct 154.8 (±89.5) 0.615 0.734 0.173
166 0.5045 (±0.0208/√100) 🟢 Qwen/Qwen2-57B-A14B 106.7 (±22.5) 0.617 0.757 0.139
167 0.5041 (±0.0225/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 106.2 (±29.3) 0.579 0.778 0.155
168 0.5022 (±0.0221/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 95.0 (±36.2) 0.579 0.795 0.132
169 0.5013 (±0.0196/√100) 🟢 google/gemma-2-9b 107.3 (±26.0) 0.595 0.761 0.148
170 0.5013 (±0.0375/√100) 💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 427.4 (±151.5) 0.579 0.723 0.202
171 0.5002 (±0.0218/√100) 🟢 Qwen/Qwen-72B-Chat 223.0 (±258.3) 0.614 0.716 0.171
172 0.4995 (±0.0211/√100) 💬 Qwen/Qwen1.5-72B-Chat 119.3 (±58.1) 0.582 0.708 0.208
173 0.4970 (±0.0117/√100) 💬 Qwen/Qwen2.5-7B-Instruct 65.0 (±22.0) 0.535 0.858 0.098
174 0.4963 (±0.0189/√100) 🟢 Qwen/Qwen1.5-72B-Chat 128.1 (±77.7) 0.586 0.698 0.206
175 0.4959 (±0.0235/√100) 🟢 llm-jp/llm-jp-13b-v1.0 115.0 (±40.9) 0.576 0.756 0.156
176 0.4953 (±0.0203/√100) 🟢 meta-llama/Llama-2-70b-hf 110.4 (±25.8) 0.596 0.745 0.145
177 0.4949 (±0.0177/√100) 💬 moneyforward/houou-instruction-7b-v1 180.5 (±66.6) 0.604 0.734 0.146
178 0.4931 (±0.0247/√100) 🟢 Rakuten/RakutenAI-7B-instruct 105.6 (±33.1) 0.598 0.750 0.132
179 0.4921 (±0.0219/√100) 🟢 Rakuten/RakutenAI-7B-chat 114.9 (±44.7) 0.592 0.760 0.124
180 0.4916 (±0.0201/√100) 🟢 moneyforward/houou-instruction-7b-v2 104.7 (±41.2) 0.588 0.770 0.116
181 0.4912 (±0.0399/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 222.0 (±126.2) 0.594 0.735 0.145
182 0.4895 (±0.0440/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 268.1 (±133.1) 0.548 0.722 0.199
183 0.4872 (±0.0237/√100) 🟢 lightblue/karasu-7B 110.1 (±19.0) 0.586 0.739 0.137
184 0.4870 (±0.0215/√100) 🟢 Qwen/Qwen-72B 134.6 (±114.6) 0.593 0.715 0.152
185 0.4868 (±0.0163/√100) 💬 google/gemma-2-9b-it 47.6 (±14.6) 0.477 0.880 0.104
186 0.4863 (±0.1167/√100) 💬 pfnet/nekomata-14b-pfn-qfin-inst-merge 93.4 (±55.0) 0.544 0.721 0.194
187 0.4862 (±0.0221/√100) 🟢 Qwen/Qwen2-57B-A14B-Instruct 116.9 (±82.5) 0.601 0.734 0.124
188 0.4857 (±0.0168/√100) 💬 moneyforward/houou-instruction-7b-v2 207.0 (±57.3) 0.591 0.719 0.147
189 0.4829 (±0.0211/√100) 🟢 Qwen/Qwen1.5-72B 136.2 (±85.6) 0.591 0.705 0.153
190 0.4827 (±0.0464/√100) 💬 llm-jp/llm-jp-13b-instruct-full-ac_00... 269.1 (±131.5) 0.542 0.716 0.191
191 0.4762 (±0.0810/√100) 💬 stabilityai/japanese-stablelm-instruc... 126.2 (±67.4) 0.545 0.726 0.158
192 0.4746 (±0.0210/√100) 🟢 rinna/youri-7b-chat 102.1 (±16.4) 0.571 0.752 0.100
193 0.4744 (±0.0227/√100) 🟢 pfnet/plamo-13b 108.2 (±28.5) 0.558 0.749 0.116
194 0.4743 (±0.0987/√100) 💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf 129.0 (±72.8) 0.535 0.725 0.163
195 0.4730 (±0.0166/√100) 🟢 Xwin-LM/Xwin-LM-13B-V0.2 109.7 (±27.4) 0.582 0.723 0.114
196 0.4723 (±0.0204/√100) 💬 Rakuten/RakutenAI-7B-chat 233.0 (±133.0) 0.565 0.734 0.118
197 0.4723 (±0.0808/√100) 💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... 199.3 (±155.6) 0.563 0.699 0.154
198 0.4698 (±0.0200/√100) 🟢 Rakuten/RakutenAI-7B 105.4 (±25.6) 0.576 0.721 0.113
199 0.4692 (±0.0161/√100) 🟢 shisa-ai/shisa-v1-qwen2-7b 109.0 (±23.9) 0.563 0.712 0.133
200 0.4661 (±0.0210/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 111.6 (±44.2) 0.536 0.756 0.106
201 0.4659 (±0.0438/√100) 💬 deepseek-ai/deepseek-llm-67b-chat 146.0 (±62.1) 0.555 0.703 0.139
202 0.4659 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-1.8b 105.0 (±16.9) 0.568 0.725 0.105
203 0.4648 (±0.1659/√100) 💬 cyberagent/calm2-7b-chat 124.7 (±95.9) 0.536 0.688 0.171
204 0.4622 (±0.0195/√100) 🟢 Qwen/Qwen-14B-Chat 135.5 (±84.3) 0.572 0.718 0.097
205 0.4619 (±0.0162/√100) 💬 lmsys/vicuna-13b-v1.5-16k 126.5 (±48.4) 0.574 0.715 0.097
206 0.4609 (±0.0113/√10) 🟢 google/gemma-2-2b-jpn-it 69.4 (±24.1) 0.509 0.805 0.069
207 0.4607 (±0.0165/√100) 🟢 SakanaAI/EvoLLM-JP-v1-7B 111.2 (±30.4) 0.579 0.708 0.095
208 0.4601 (±0.0184/√100) 🟢 shisa-ai/shisa-v1-llama3-8b 112.9 (±31.4) 0.557 0.703 0.120
209 0.4597 (±0.0268/√100) 🟢 CohereForAI/c4ai-command-r-v01 179.2 (±166.3) 0.590 0.592 0.197
210 0.4586 (±0.0141/√100) 🟢 google/gemma-2-2b-it 88.2 (±30.8) 0.536 0.761 0.079
211 0.4561 (±0.0202/√100) 🟢 pfnet/plamo-13b-instruct 144.0 (±147.7) 0.532 0.763 0.073
212 0.4559 (±0.0201/√100) 🟢 pfnet/plamo-13b-instruct-nc 156.0 (±183.1) 0.523 0.768 0.077
213 0.4558 (±0.0156/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 75.3 (±26.6) 0.488 0.804 0.076
214 0.4543 (±0.0217/√100) 🟢 rinna/youri-7b-instruction 96.2 (±29.5) 0.530 0.743 0.090
215 0.4535 (±0.0348/√100) 💬 Rakuten/RakutenAI-7B-instruct 128.6 (±83.2) 0.527 0.726 0.108
216 0.4535 (±0.0183/√100) 🟢 THUDM/glm-4-9b 110.3 (±36.9) 0.554 0.689 0.118
217 0.4527 (±0.0146/√100) 🟢 lmsys/vicuna-13b-v1.5-16k 107.9 (±25.9) 0.576 0.708 0.075
218 0.4504 (±0.0224/√100) 🟢 rinna/nekomata-7b-instruction 96.4 (±23.7) 0.528 0.734 0.089
219 0.4486 (±0.0161/√100) 💬 Qwen/Qwen2-7B-Instruct 163.6 (±61.4) 0.547 0.688 0.111
220 0.4484 (±0.0191/√100) 💬 SakanaAI/EvoLLM-JP-v1-7B 123.9 (±68.1) 0.545 0.706 0.094
221 0.4477 (±0.0205/√100) 🟢 rinna/llama-3-youko-70b-instruct 130.7 (±95.3) 0.527 0.670 0.146
222 0.4426 (±0.0204/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... 111.1 (±28.2) 0.544 0.687 0.097
223 0.4409 (±0.1064/√100) 💬 lightblue/karasu-7B 138.1 (±92.9) 0.512 0.679 0.131
224 0.4404 (±0.0146/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 75.9 (±22.7) 0.493 0.773 0.056
225 0.4387 (±0.0655/√100) 💬 Qwen/Qwen-72B-Chat 117.7 (±137.1) 0.541 0.632 0.143
226 0.4385 (±0.0285/√100) 💬 rinna/youri-7b-chat 95.4 (±41.1) 0.500 0.733 0.083
227 0.4377 (±0.0107/√100) 🟢 google/gemma-1.1-7b-it 86.8 (±21.4) 0.509 0.732 0.072
228 0.4374 (±0.0217/√100) 🟢 Qwen/Qwen1.5-32B-Chat 127.0 (±57.0) 0.538 0.642 0.133
229 0.4336 (±0.0168/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.1 (±17.2) 0.539 0.689 0.073
230 0.4335 (±0.0221/√100) 🟢 Qwen/Qwen-14B 118.1 (±71.6) 0.530 0.675 0.096
231 0.4332 (±0.0164/√100) 🟢 Qwen/Qwen2-7B-Instruct 119.1 (±45.7) 0.531 0.670 0.098
232 0.4330 (±0.0149/√100) 💬 google/gemma-2-2b-it 56.0 (±27.8) 0.445 0.788 0.066
233 0.4320 (±0.0171/√100) 🟢 Qwen/Qwen2-7B 109.1 (±40.1) 0.532 0.671 0.093
234 0.4296 (±0.0322/√100) 💬 Qwen/Qwen-14B-Chat 159.0 (±69.7) 0.522 0.675 0.092
235 0.4295 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct 111.5 (±31.4) 0.530 0.676 0.083
236 0.4292 (±0.0181/√100) 💬 Xwin-LM/Xwin-LM-13B-V0.2 240.7 (±48.4) 0.533 0.670 0.085
237 0.4282 (±0.0193/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 110.8 (±26.0) 0.518 0.688 0.078
238 0.4272 (±0.0273/√100) 🟢 mistralai/Mistral-Nemo-Instruct-2407 155.8 (±132.8) 0.548 0.611 0.122
239 0.4265 (±0.0115/√100) 💬 google/gemma-1.1-7b-it 78.7 (±28.4) 0.475 0.739 0.066
240 0.4256 (±0.0270/√100) 🟢 rinna/japanese-gpt-neox-3.6b 129.8 (±73.4) 0.485 0.685 0.106
241 0.4228 (±0.0185/√100) 🟢 stabilityai/japanese-stablelm-base-ja... 110.4 (±28.6) 0.528 0.668 0.073
242 0.4222 (±0.0138/√100) 🟢 Xwin-LM/Xwin-LM-7B-V0.2 110.6 (±29.3) 0.520 0.677 0.070
243 0.4220 (±0.0185/√100) 🟢 lmsys/vicuna-7b-v1.5-16k 111.8 (±31.8) 0.522 0.670 0.074
244 0.4207 (±0.0189/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 112.8 (±27.0) 0.507 0.683 0.072
245 0.4201 (±0.0177/√100) 💬 lmsys/vicuna-7b-v1.5-16k 128.1 (±52.5) 0.514 0.668 0.078
246 0.4164 (±0.0244/√100) 🟢 google/gemma-7b 135.5 (±132.3) 0.533 0.631 0.085
247 0.4150 (±0.0212/√100) 💬 Qwen/Qwen1.5-32B-Chat 125.7 (±250.5) 0.496 0.620 0.130
248 0.4149 (±0.0375/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 186.6 (±108.4) 0.469 0.685 0.090
249 0.4144 (±0.0149/√100) 💬 01-ai/Yi-1.5-34B-Chat 170.6 (±47.1) 0.514 0.628 0.101
250 0.4140 (±0.0208/√100) 🟢 meta-llama/Meta-Llama-3-8B-Instruct 116.8 (±44.3) 0.523 0.637 0.082
251 0.4125 (±0.0303/√100) 💬 CohereForAI/c4ai-command-r-v01 137.7 (±324.6) 0.519 0.562 0.157
252 0.4122 (±0.0199/√100) 🟢 rinna/bilingual-gpt-neox-4b 121.0 (±43.6) 0.485 0.660 0.092
253 0.4097 (±0.0187/√100) 🟢 meta-llama/Meta-Llama-3.1-8B 108.7 (±35.4) 0.512 0.650 0.068
254 0.4087 (±0.0201/√100) 🟢 meta-llama/Llama-2-70b-chat-hf 161.3 (±140.8) 0.519 0.608 0.099
255 0.4087 (±0.0146/√100) 🟢 microsoft/Phi-3-small-8k-instruct 109.1 (±24.1) 0.514 0.644 0.068
256 0.4076 (±0.0142/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... 109.0 (±32.9) 0.503 0.644 0.076
257 0.4074 (±0.0207/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-inst... 156.6 (±65.9) 0.490 0.646 0.086
258 0.4073 (±0.0175/√100) 🟢 stabilityai/japanese-stablelm-instruc... 110.0 (±26.5) 0.490 0.663 0.070
259 0.4058 (±0.0295/√100) 💬 rinna/youri-7b-instruction 97.0 (±57.0) 0.439 0.713 0.065
260 0.4050 (±0.0191/√100) 🟢 mistralai/Mixtral-8x22B-v0.1 115.6 (±55.4) 0.517 0.615 0.084
261 0.4048 (±0.0175/√100) 🟢 meta-llama/Meta-Llama-3-8B 109.0 (±19.8) 0.505 0.641 0.068
262 0.4048 (±0.0263/√20) 💬 ntt/tsuzumi-7b 172.0 (±90.8) 0.491 0.644 0.080
263 0.4045 (±0.0186/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 133.1 (±57.4) 0.475 0.678 0.061
264 0.4042 (±0.0131/√100) 🟢 microsoft/Orca-2-13b 115.5 (±42.6) 0.510 0.630 0.073
265 0.4041 (±0.0218/√100) 💬 meta-llama/Meta-Llama-3-8B-Instruct 131.4 (±88.3) 0.508 0.614 0.090
266 0.4035 (±0.0151/√100) 🟢 SakanaAI/EvoLLM-JP-A-v1-7B 110.4 (±31.3) 0.508 0.633 0.069
267 0.4033 (±0.0164/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... 107.2 (±28.5) 0.495 0.643 0.072
268 0.4032 (±0.0237/√100) 🟢 Qwen/Qwen1.5-32B 150.3 (±104.8) 0.505 0.605 0.100
269 0.4024 (±0.0187/√100) 🟢 01-ai/Yi-1.5-34B 109.9 (±28.2) 0.493 0.631 0.083
270 0.4011 (±0.0236/√100) 🟢 cyberagent/open-calm-7b 143.8 (±97.0) 0.472 0.641 0.091
271 0.4006 (±0.0166/√100) 💬 microsoft/Phi-3-small-8k-instruct 189.7 (±84.1) 0.500 0.630 0.073
272 0.4001 (±0.0199/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 117.6 (±48.9) 0.464 0.684 0.052
273 0.3985 (±0.0161/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b 138.4 (±51.8) 0.493 0.634 0.069
274 0.3960 (±0.0199/√100) 🟢 line-corporation/japanese-large-lm-1.7b 179.2 (±174.5) 0.474 0.650 0.065
275 0.3949 (±0.0193/√100) 💬 meta-llama/Meta-Llama-3.1-8B-Instruct 216.6 (±345.2) 0.487 0.624 0.074
276 0.3948 (±0.0190/√100) 💬 Qwen/Qwen1.5-14B-Chat 127.9 (±50.6) 0.500 0.604 0.080
277 0.3946 (±0.0201/√100) 🟢 Qwen/Qwen1.5-14B 130.9 (±67.8) 0.509 0.609 0.066
278 0.3934 (±0.0201/√100) 🟢 stabilityai/japanese-stablelm-instruc... 107.8 (±38.0) 0.466 0.648 0.066
279 0.3914 (±0.0172/√100) 🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 95.1 (±25.2) 0.488 0.636 0.050
280 0.3863 (±0.0160/√100) 🟢 Qwen/Qwen1.5-14B-Chat 131.4 (±55.8) 0.491 0.593 0.075
281 0.3837 (±0.0188/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 117.4 (±42.4) 0.462 0.649 0.041
282 0.3823 (±0.0645/√100) 💬 mistralai/Mistral-Nemo-Instruct-2407 157.9 (±140.3) 0.484 0.563 0.100
283 0.3822 (±0.0647/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 97.6 (±76.2) 0.397 0.664 0.086
284 0.3819 (±0.0265/√100) 🟢 google/gemma-2-27b 214.2 (±183.3) 0.450 0.608 0.087
285 0.3804 (±0.0161/√100) 🟢 Qwen/Qwen-7B-Chat 140.8 (±65.1) 0.485 0.612 0.045
286 0.3803 (±0.0249/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-instruct 136.4 (±70.7) 0.452 0.619 0.070
287 0.3772 (±0.0162/√100) 💬 microsoft/Phi-3-small-128k-instruct 199.7 (±111.9) 0.473 0.590 0.069
288 0.3760 (±0.0236/√100) 🟢 cyberagent/open-calm-3b 123.2 (±79.0) 0.442 0.624 0.062
289 0.3759 (±0.0149/√100) 🟢 lmsys/longchat-7b-v1.5-32k 116.9 (±31.6) 0.474 0.609 0.045
290 0.3740 (±0.0164/√100) 🟢 meta-llama/Llama-2-13b-hf 108.5 (±21.8) 0.474 0.603 0.045
291 0.3737 (±0.0197/√100) 🟢 meta-llama/Meta-Llama-3.1-8B-Instruct 204.5 (±303.4) 0.478 0.589 0.055
292 0.3720 (±0.0622/√100) 💬 Xwin-LM/Xwin-LM-7B-V0.2 205.3 (±79.1) 0.466 0.590 0.060
293 0.3720 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast 177.5 (±147.2) 0.458 0.598 0.061
294 0.3699 (±0.0345/√100) 💬 Qwen/Qwen-7B-Chat 182.9 (±110.3) 0.468 0.600 0.042
295 0.3694 (±0.0103/√100) 🟢 google/gemma-7b-it 89.7 (±21.6) 0.446 0.640 0.022
296 0.3685 (±0.0173/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b 140.0 (±52.8) 0.462 0.596 0.047
297 0.3673 (±0.0089/√100) 💬 google/gemma-7b-it 110.0 (±47.6) 0.448 0.633 0.020
298 0.3655 (±0.0116/√100) 🟢 deepseek-ai/deepseek-llm-7b-chat 113.9 (±24.7) 0.474 0.579 0.043
299 0.3642 (±0.0165/√100) 🟢 llm-jp/llm-jp-1.3b-v1.0 134.0 (±62.6) 0.437 0.612 0.044
300 0.3637 (±0.0223/√100) 🟢 cyberagent/open-calm-large 122.3 (±73.9) 0.424 0.611 0.056
301 0.3637 (±0.0152/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast 168.0 (±77.4) 0.452 0.587 0.052
302 0.3632 (±0.0237/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... 178.6 (±113.6) 0.443 0.582 0.064
303 0.3628 (±0.0145/√100) 🟢 Qwen/Qwen-7B 117.3 (±39.0) 0.468 0.582 0.039
304 0.3554 (±0.0178/√100) 🟢 meta-llama/Llama-2-7b-chat-hf 139.3 (±93.1) 0.464 0.570 0.031
305 0.3545 (±0.0445/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 48.8 (±50.1) 0.283 0.723 0.058
306 0.3543 (±0.0439/√100) 💬 lmsys/longchat-7b-v1.5-32k 160.1 (±73.5) 0.448 0.572 0.043
307 0.3538 (±0.0175/√100) 🟢 01-ai/Yi-1.5-9B 113.0 (±29.4) 0.457 0.555 0.050
308 0.3531 (±0.0159/√100) 🟢 mistralai/Mixtral-8x7B-v0.1 94.3 (±20.8) 0.450 0.573 0.037
309 0.3514 (±0.0102/√100) 🟢 google/gemma-1.1-2b-it 80.4 (±21.6) 0.404 0.625 0.025
310 0.3495 (±0.0268/√100) 🟢 cyberagent/open-calm-1b 141.3 (±110.0) 0.412 0.578 0.059
311 0.3471 (±0.0131/√100) 🟢 microsoft/Orca-2-7b 131.1 (±70.7) 0.447 0.555 0.039
312 0.3465 (±0.0202/√100) 💬 deepseek-ai/deepseek-llm-7b-chat 167.2 (±76.5) 0.435 0.562 0.042
313 0.3463 (±0.0178/√100) 💬 mistralai/Mixtral-8x7B-Instruct-v0.1 147.1 (±111.8) 0.448 0.548 0.043
314 0.3449 (±0.0986/√100) 💬 stabilityai/japanese-stablelm-instruc... 109.4 (±66.2) 0.397 0.585 0.053
315 0.3440 (±0.0978/√100) 💬 stabilityai/japanese-stablelm-3b-4e1t... 127.8 (±80.5) 0.401 0.576 0.055
316 0.3436 (±0.0126/√100) 💬 01-ai/Yi-1.5-9B-Chat 143.6 (±60.1) 0.438 0.540 0.053
317 0.3428 (±0.0163/√100) 🟢 meta-llama/Llama-2-7b-hf 112.3 (±28.0) 0.440 0.550 0.038
318 0.3408 (±0.0225/√100) 🟢 anthracite-org/magnum-32b-v2 191.9 (±223.2) 0.442 0.507 0.073
319 0.3393 (±0.0225/√100) 🟢 stockmark/gpt-neox-japanese-1.4b 92.2 (±63.7) 0.351 0.641 0.025
320 0.3338 (±0.0493/√100) 🟢 SakanaAI/TinySwallow-1.5B 142.2 (±109.9) 0.415 0.534 0.052
321 0.3322 (±0.0151/√100) 🟢 Qwen/Qwen1.5-7B-Chat 127.7 (±117.0) 0.431 0.520 0.045
322 0.3315 (±0.0203/√100) 🟢 Qwen/Qwen1.5-7B 141.8 (±126.5) 0.445 0.504 0.046
323 0.3313 (±0.0115/√100) 🟢 google/gemma-2b-it 85.9 (±24.7) 0.393 0.577 0.024
324 0.3293 (±0.0252/√100) 💬 Qwen/Qwen1.5-7B-Chat 195.7 (±113.1) 0.429 0.503 0.056
325 0.3276 (±0.0709/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-fast... 134.0 (±98.8) 0.395 0.543 0.045
326 0.3272 (±0.0101/√100) 💬 01-ai/Yi-1.5-6B-Chat 194.4 (±75.0) 0.426 0.530 0.025
327 0.3187 (±0.0142/√100) 🟢 Qwen/Qwen2-1.5B-Instruct 131.4 (±46.7) 0.421 0.513 0.022
328 0.3172 (±0.0150/√100) 🟢 Qwen/Qwen2-1.5B 120.9 (±30.7) 0.422 0.511 0.019
329 0.3161 (±0.0119/√100) 🟢 deepseek-ai/deepseek-llm-7b-base 113.7 (±21.6) 0.424 0.501 0.024
330 0.3147 (±0.0175/√100) 💬 Qwen/Qwen2-1.5B-Instruct 180.7 (±101.0) 0.408 0.511 0.025
331 0.3078 (±0.0195/√100) 🟢 cyberagent/open-calm-medium 117.3 (±59.4) 0.363 0.537 0.024
332 0.3058 (±0.1106/√100) 💬 rinna/nekomata-7b-instruction 61.2 (±57.0) 0.307 0.567 0.043
333 0.3053 (±0.0177/√100) 🟢 google/gemma-2b 151.5 (±113.6) 0.410 0.480 0.026
334 0.3050 (±0.0190/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B 146.4 (±90.3) 0.412 0.468 0.035
335 0.2993 (±0.0095/√100) 🟢 01-ai/Yi-1.5-6B-Chat 133.3 (±46.2) 0.394 0.481 0.022
336 0.2993 (±0.0107/√100) 🟢 tiiuae/falcon-11B 121.6 (±31.5) 0.398 0.483 0.016
337 0.2957 (±0.0641/√100) 💬 meta-llama/Llama-2-13b-chat-hf 305.2 (±299.7) 0.402 0.453 0.032
338 0.2953 (±0.0442/√100) 🟢 augmxnt/shisa-base-7b-v1 200.4 (±160.3) 0.378 0.478 0.030
339 0.2924 (±0.0506/√100) 💬 Qwen/Qwen1.5-MoE-A2.7B-Chat 245.1 (±209.1) 0.381 0.453 0.043
340 0.2914 (±0.0133/√100) 🟢 mistralai/Mistral-7B-v0.1 117.4 (±40.4) 0.402 0.454 0.018
341 0.2907 (±0.0175/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat 149.8 (±91.0) 0.388 0.448 0.036
342 0.2853 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B-Chat 127.8 (±71.2) 0.395 0.441 0.019
343 0.2809 (±0.0133/√100) 🟢 Qwen/Qwen1.5-1.8B-Chat 178.3 (±92.0) 0.381 0.445 0.017
344 0.2770 (±0.0131/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.2 146.2 (±70.1) 0.387 0.419 0.024
345 0.2769 (±0.0324/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 16.9 (±24.6) 0.125 0.693 0.013
346 0.2769 (±0.1029/√100) 💬 stabilityai/japanese-stablelm-instruc... 117.0 (±115.0) 0.307 0.489 0.035
347 0.2666 (±0.0241/√100) 🟢 deepseek-ai/deepseek-llm-67b-chat 140.2 (±83.0) 0.351 0.440 0.009
348 0.2661 (±0.0128/√100) 🟢 Qwen/Qwen1.5-1.8B 129.7 (±65.7) 0.360 0.424 0.014
349 0.2613 (±0.0136/√100) 🟢 Qwen/Qwen2-0.5B-Instruct 176.8 (±98.9) 0.351 0.426 0.007
350 0.2604 (±0.0148/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.1 139.8 (±101.3) 0.367 0.400 0.014
351 0.2598 (±0.0129/√100) 🟢 Qwen/Qwen2-0.5B 122.7 (±43.5) 0.350 0.420 0.009
352 0.2581 (±0.0196/√100) 🟢 cyberagent/open-calm-small 119.1 (±54.1) 0.310 0.460 0.004
353 0.2555 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B 149.2 (±76.6) 0.363 0.388 0.015
354 0.2543 (±0.0266/√100) 🟢 mosaicml/mpt-30b-chat 121.3 (±46.4) 0.327 0.428 0.008
355 0.2414 (±0.0281/√100) 💬 Qwen/Qwen1.5-1.8B-Chat 480.0 (±210.3) 0.329 0.392 0.003
356 0.2394 (±0.0745/√100) 💬 Qwen/Qwen1.5-4B-Chat 105.3 (±104.1) 0.307 0.390 0.021
357 0.2317 (±0.0455/√100) 💬 mistralai/Mistral-7B-Instruct-v0.1 202.3 (±153.9) 0.320 0.362 0.012
358 0.2231 (±0.0166/√100) 💬 mistralai/Mistral-7B-Instruct-v0.2 261.2 (±166.3) 0.316 0.334 0.019
359 0.2182 (±0.0152/√100) 🟢 microsoft/phi-1 47.6 (±34.3) 0.234 0.420 0.000
360 0.2177 (±0.0110/√100) 🟢 Qwen/Qwen1.5-0.5B-Chat 143.4 (±52.1) 0.317 0.327 0.009
361 0.2169 (±0.0561/√100) 💬 Qwen/Qwen2-0.5B-Instruct 129.5 (±114.3) 0.265 0.379 0.006
362 0.2169 (±0.0218/√100) 🟢 mosaicml/mpt-30b-instruct 109.8 (±36.1) 0.274 0.370 0.008
363 0.2146 (±0.0151/√100) 🟢 microsoft/phi-2 78.0 (±31.4) 0.287 0.356 0.001
364 0.2061 (±0.0820/√100) 💬 meta-llama/Llama-2-70b-chat-hf 523.3 (±444.5) 0.271 0.303 0.045
365 0.2040 (±0.0152/√100) 🟢 Qwen/Qwen1.5-0.5B 138.6 (±55.9) 0.296 0.314 0.003
366 0.2038 (±0.0538/√100) 🟢 mosaicml/mpt-30b 236.5 (±433.3) 0.271 0.334 0.007
367 0.1885 (±0.0194/√100) 🟢 microsoft/phi-1_5 77.5 (±33.6) 0.258 0.306 0.001
368 0.1833 (±0.0406/√100) 💬 google/gemma-1.1-2b-it 32.6 (±26.7) 0.171 0.376 0.003
369 0.1765 (±0.0439/√100) 💬 Qwen/Qwen1.5-0.5B-Chat 214.3 (±172.6) 0.251 0.276 0.002
370 0.1687 (±0.0172/√100) 🟢 upstage/SOLAR-10.7B-v1.0 171.0 (±87.1) 0.265 0.237 0.004
371 0.1544 (±0.0132/√100) 🟢 01-ai/Yi-1.5-34B-Chat 730.0 (±533.6) 0.201 0.256 0.006
372 0.1475 (±0.0826/√100) 💬 mosaicml/mpt-30b-chat 112.2 (±112.4) 0.182 0.254 0.007
373 0.1241 (±0.0558/√100) 💬 google/gemma-2b-it 24.1 (±24.6) 0.115 0.257 0.000
374 0.1226 (±0.0240/√100) 🟢 Deci/DeciLM-7B 174.0 (±165.5) 0.190 0.174 0.003
375 0.1160 (±0.0081/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 212.1 (±148.9) 0.153 0.195 0.000
376 0.1009 (±0.0846/√100) 💬 meta-llama/Llama-2-7b-chat-hf 241.5 (±336.2) 0.136 0.158 0.009
377 0.1004 (±0.0094/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 123.1 (±128.8) 0.119 0.182 0.000
378 0.0987 (±0.0145/√100) 🟢 deepseek-ai/deepseek-llm-67b-base 154.2 (±77.3) 0.174 0.121 0.000
379 0.0982 (±0.1596/√100) 💬 rinna/nekomata-14b-instruction 16.0 (±38.1) 0.115 0.141 0.039
380 0.0955 (±0.0102/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 129.5 (±141.0) 0.116 0.170 0.000
381 0.0939 (±0.0064/√100) 🟢 sbintuitions/tiny-lm-chat 250.2 (±275.6) 0.133 0.149 0.000
382 0.0936 (±0.0082/√100) 💬 sbintuitions/tiny-lm-chat 276.7 (±209.6) 0.135 0.145 0.000
383 0.0921 (±0.0058/√100) 🟢 sbintuitions/tiny-lm 471.9 (±199.0) 0.135 0.142 0.000
384 0.0880 (±0.0334/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 134.0 (±144.7) 0.105 0.159 0.000
385 0.0762 (±0.0033/√100) 🟢 line-corporation/japanese-large-lm-3.6b 1066.6 (±31.6) 0.125 0.103 0.000
386 0.0760 (±0.0032/√100) 🟢 line-corporation/japanese-large-lm-3.... 1066.4 (±31.8) 0.125 0.103 0.000
387 0.0758 (±0.0034/√100) 💬 line-corporation/japanese-large-lm-3.... 1067.2 (±31.8) 0.125 0.102 0.000
388 0.0673 (±0.0085/√100) 🟢 moneyforward/houou-instruction-7b-v3 143.2 (±112.2) 0.098 0.104 0.000
389 0.0625 (±0.0169/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 31.6 (±10.3) 0.088 0.099 0.000
390 0.0429 (±0.0440/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 31.7 (±54.7) 0.045 0.084 0.000
391 0.0406 (±0.0028/√100) 🟢 microsoft/Phi-3-small-128k-instruct 268.1 (±123.4) 0.083 0.039 0.000
392 0.0337 (±0.0026/√100) 🟢 augmxnt/shisa-7b-v1 590.7 (±238.2) 0.076 0.025 0.000
393 0.0284 (±0.0012/√100) 🟢 lightblue/karasu-7B-chat-plus 285.1 (±53.8) 0.080 0.005 0.000
394 0.0225 (±0.0702/√100) 💬 SakanaAI/EvoLLM-JP-A-v1-7B 5.9 (±27.6) 0.026 0.037 0.005
395 0.0180 (±0.0039/√100) 🟢 mistralai/Mistral-Nemo-Base-2407 607.5 (±344.5) 0.039 0.015 0.000
396 0.0047 (±0.0024/√100) 🟢 ai-forever/mGPT-13B 321.1 (±266.7) 0.008 0.006 0.000
397 0.0022 (±0.0006/√100) 🟢 lightblue/qarasu-14B-chat-plus-unleashed 937.5 (±557.0) 0.004 0.002 0.000
398 0.0019 (±0.0002/√100) 🟢 01-ai/Yi-1.5-9B-Chat 1440.0 (±51.9) 0.005 0.001 0.000
399 0.0018 (±0.0004/√100) 🟢 CohereForAI/aya-23-8B 1676.6 (±351.0) 0.004 0.002 0.000
400 0.0006 (±0.0002/√100) 🟢 meta-llama/Llama-2-13b-chat-hf 1523.9 (±43.5) 0.001 0.001 0.000
401 0.0000 (±0.0000/√100) 🟢 01-ai/Yi-1.5-6B 0.0 (±0.0) 0.000 0.000 0.000
402 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-1.1B 0.0 (±0.0) 0.000 0.000 0.000
403 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat-plus-unleashed 0.0 (±0.0) 0.000 0.000 0.000
404 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat 0.0 (±0.0) 0.000 0.000 0.000
405 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-japanese 300.0 (±0.0) 0.000 0.000 0.000
406 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-multilingual 300.0 (±0.0) 0.000 0.000 0.000

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}

About

Preferred Generation Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published