Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New benchmark run from scratch #128

Open
haesleinhuepf opened this issue Sep 9, 2024 · 0 comments
Open

New benchmark run from scratch #128

haesleinhuepf opened this issue Sep 9, 2024 · 0 comments

Comments

@haesleinhuepf
Copy link
Owner

After merging these pull-requests:

We need to rerun all benchmarks. And I think we should not benchmark all the models we did before. E.g. I consider gpt-3.5 and gemini 1.0 as outdated.

We should discuss a potential timeline. @jkh1 could you help us with this again? If yes, when would be a good time for you? (e.g. October, November, December?)

Let's start a list of models, we should definitely benchmark (from my limited perspective):

  • claude-3-5-sonnet-20240620
  • claude-3-opus-20240229
  • gpt-4-0613
  • gpt-4o-2024-08-06
  • gpt-4o-mini-2024-07-18
  • gemini-1.5-flash
  • gemini-1.5-pro
  • deepseek-coder-v2:16b
  • llama3.1-70b-instruct
  • llama3.1-7b-instruct
  • phi3.5-3.8b-mini-instruct
  • phi3.5-3.8b
  • codegemma-2b
  • codegemma-7b

Wishlist (models that might be interesting and hard to benchmark):

  • llama3.1 405b
  • deepseek-coder-v2 236b

Optional models, where I'm not sure:

  • gemma2:2b
  • gemma2:9b
  • gemma2:27b
  • mistral:7b
  • mixtral-8:7b
  • mixtral-8:22b

Selection criteria: We should cover both commercial and open-weight models properly. We should also exclude Models that showed poor performance in the former run to not waste resources.

When you post your opinion below and we conclude to include more models, I will update the list above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant