New benchmark run from scratch #128

haesleinhuepf · 2024-09-09T08:06:30Z

After merging these pull-requests:

We need to rerun all benchmarks. And I think we should not benchmark all the models we did before. E.g. I consider gpt-3.5 and gemini 1.0 as outdated.

We should discuss a potential timeline. @jkh1 could you help us with this again? If yes, when would be a good time for you? (e.g. October, November, December?)

Let's start a list of models, we should definitely benchmark (from my limited perspective):

claude-3-5-sonnet-20240620
claude-3-opus-20240229
gpt-4-0613
gpt-4o-2024-08-06
gpt-4o-mini-2024-07-18
gemini-1.5-flash
gemini-1.5-pro
deepseek-coder-v2:16b
llama3.1-70b-instruct
llama3.1-7b-instruct
phi3.5-3.8b-mini-instruct
phi3.5-3.8b
codegemma-2b
codegemma-7b

Wishlist (models that might be interesting and hard to benchmark):

llama3.1 405b
deepseek-coder-v2 236b

Optional models, where I'm not sure:

gemma2:2b
gemma2:9b
gemma2:27b
mistral:7b
mixtral-8:7b
mixtral-8:22b

Selection criteria: We should cover both commercial and open-weight models properly. We should also exclude Models that showed poor performance in the former run to not waste resources.

When you post your opinion below and we conclude to include more models, I will update the list above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New benchmark run from scratch #128

New benchmark run from scratch #128

haesleinhuepf commented Sep 9, 2024

New benchmark run from scratch #128

New benchmark run from scratch #128

Comments

haesleinhuepf commented Sep 9, 2024