Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BFCL] Add BFCL_V2_Live Dataset (#580)
In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into `simple`, `multiple function`, `parallel function`, `parallel multiple function`, and `relevance detection` groups, all evaluated through AST (Abstract Syntax Tree). By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments. To read more about the composition and construction of this live dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset. --------- **Also in this PR**: 1. Update to BFCL Dataset Format: - In the V1 version of BFCL, the `question` field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response. - To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like `"What is the weather like in Berkeley, CA"` is now represented as `"[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]"`. - Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible. - All model handlers and the eval checker has been updated accordingly. 2. Update to the overall_accuracy calculation formula: - For BFCL V2 Leaderboard, the overall accuracy will be the **unweighted** average of each of the sub-categories`. - `"exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` - For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the **weighted** average of each of the Live sub-categories. - `"live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` 3. Simplification of Claude Handlers: - Previously, the codebase included two separate handlers: `ClaudeFCHandler` (for Claude models in FC mode) and `ClaudePromptingHandler` (for Claude models in prompting mode). - This PR merges these into a single `ClaudeHandler`, streamlining the code without altering functionality. 4. Improve Error Log Readability 5. resolve #485 --------- Co-authored-by: Charlie Cheng-Jie Ji <[email protected]> Co-authored-by: Fanjia Yan <[email protected]>
- Loading branch information