You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to rerun all benchmarks. And I think we should not benchmark all the models we did before. E.g. I consider gpt-3.5 and gemini 1.0 as outdated.
We should discuss a potential timeline. @jkh1 could you help us with this again? If yes, when would be a good time for you? (e.g. October, November, December?)
Let's start a list of models, we should definitely benchmark (from my limited perspective):
claude-3-5-sonnet-20240620
claude-3-opus-20240229
gpt-4-0613
gpt-4o-2024-08-06
gpt-4o-mini-2024-07-18
gemini-1.5-flash
gemini-1.5-pro
deepseek-coder-v2:16b
llama3.1-70b-instruct
llama3.1-7b-instruct
phi3.5-3.8b-mini-instruct
phi3.5-3.8b
codegemma-2b
codegemma-7b
Wishlist (models that might be interesting and hard to benchmark):
llama3.1 405b
deepseek-coder-v2 236b
Optional models, where I'm not sure:
gemma2:2b
gemma2:9b
gemma2:27b
mistral:7b
mixtral-8:7b
mixtral-8:22b
Selection criteria: We should cover both commercial and open-weight models properly. We should also exclude Models that showed poor performance in the former run to not waste resources.
When you post your opinion below and we conclude to include more models, I will update the list above.
The text was updated successfully, but these errors were encountered:
After merging these pull-requests:
We need to rerun all benchmarks. And I think we should not benchmark all the models we did before. E.g. I consider gpt-3.5 and gemini 1.0 as outdated.
We should discuss a potential timeline. @jkh1 could you help us with this again? If yes, when would be a good time for you? (e.g. October, November, December?)
Let's start a list of models, we should definitely benchmark (from my limited perspective):
Wishlist (models that might be interesting and hard to benchmark):
Optional models, where I'm not sure:
Selection criteria: We should cover both commercial and open-weight models properly. We should also exclude Models that showed poor performance in the former run to not waste resources.
When you post your opinion below and we conclude to include more models, I will update the list above.
The text was updated successfully, but these errors were encountered: