Release v0.5.2 · stanford-crfm/helm

Updated VHELM scenarios for VLMs (#2719, #2684, #2685, #2641, #2691)
Updated Image2Struct scenarios (#2608, #2640, #2660, #2661)
Added Automatic GPT4V Evaluation for VLM Originality Evaluation
Added FinQA scenario (#2588)
Added AIR-Bench 2024 (#2698, #2706, #2710, #2712, #2713)
Fixed entity_data_imputation scenario breakage by mirroring source data files (#2750)

Added google-cloud-aiplatform~=1.48 dependency requirement for Vertex AI client (#2628)
Fixed bug with Vertex AI client error handling (#2614)
Fixed bug with for Arctic tokenizer (#2615)
Added Qwen1.5 110B Chat (#2621)
Added TogetherCompletionClient (#2629)
Fixed bugs with Yi Chat and Llama 3 Chat on Together (#2636)
Added Optimum Intel (#2609, #2674)
Added GPT-4o model (#2649, #2656)
Added SEA-LION 7B and SEA-LION 7B Instruct (#2647)
Added more Gemini 1.5 Flash and Pro versions (#2653, #2664, #2718, #2718)
Added Gemini 1.0 Pro 002 (#2664)
Added Command R and Command R+ models (#2548)
Fixed GPT4V Evaluator Out of Option Range Issue (#2677)
Added OLMo 1.5 (#2671)
Added RekaClient (#2675)
Added PaliGemma (#2683)
Added Mistral 7B Instruct v0.1, v0.2 and v0.3 (#2665)
Switched most Together chat models to use the chat client (#2703, #2701, #2705)
Added MedLM model (#2696, #2709)
Added Typhoon v1.5 models (#2659)
Changed HuggingFaceClient to truncate end of text token (#2643)
Added Qwen2 Instruct (72B) (#2722)
Added Yi Large (#2723, #1731)
Added Sailor models (#2658)
Added BioMistral and Meditron (#2728)

Miscellaneous improvements and bug fixes (#2618, #2617, #2616, #2651, #2667, #2724)

MMLU v1.2.0
- Added results for DBRX Instruct, DeepSeek LLM Chat (67B), Gemini 1.5 Pro (0409 preview), Mistral Small (2402), Mistral Large (2402), Arctic Instruct
MMLU v1.3.0
- Added results for Gemini 1.5 Flash (0514 preview), GPT-4o (2024-05-13), Palmyra X V3 (72B)
MMLU v1.4.0
- Added results for Yi Large (Preview), OLMo 1.7 (7B), Command R, Command R Plus, Gemini 1.5 Flash (001), Gemini 1.5 Pro (001), Mistral Instruct v0.3 (7B), GPT-4 Turbo (2024-04-09), Qwen1.5 Chat (110B), Qwen2 Instruct (72B)
Image2Struct v1.0.0
- Initial release with Claude 3 Sonnet (20240229), Claude 3 Opus (20240229), Gemini 1.0 Pro Vision, Gemini 1.5 Pro (0409 preview),IDEFICS 2 (8B), IDEFICS-instruct (9B), IDEFICS-instruct (80B), LLaVA 1.5 (13B), LLaVA 1.6 (13B), GPT-4o (2024-05-13), GPT-4V (1106 preview), Qwen-VL Chat
AIR-Bench v1.0.0
- Initial release with Claude 3 Haiku (20240307), Claude 3 Sonnet (20240229), Claude 3 Opus (20240229), Cohere Command R, Cohere Command R Plus, DBRX Instruct, DeepSeek LLM Chat (67B), Gemini 1.5 Pro (001, default safety), Gemini 1.5 Flash (001, default safety), Llama 3 Instruct (8B), Llama 3 Instruct (70B), Yi Chat (34B), Mistral Instruct v0.3 (7B), Mixtral Instruct (8x7B), Mixtral Instruct (8x22B), GPT-3.5 Turbo (0613), GPT-3.5 Turbo (1106), GPT-3.5 Turbo (0125), GPT-4 Turbo (2024-04-09), GPT-4o (2024-05-13), Qwen1.5 Chat (72B)

Thank you to the following contributors for your work on this HELM release!

Provide feedback