[BFCL] Add BFCL_V2_Live Dataset (#580)

In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into `simple`, `multiple function`, `parallel function`, `parallel multiple function`, and `relevance detection` groups, all evaluated through AST (Abstract Syntax Tree). By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments. To read more about the composition and construction of this live dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset. --------- **Also in this PR**: 1. Update to BFCL Dataset Format: - In the V1 version of BFCL, the `question` field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response. - To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like `"What is the weather like in Berkeley, CA"` is now represented as `"[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]"`. - Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible. - All model handlers and the eval checker has been updated accordingly. 2. Update to the overall_accuracy calculation formula: - For BFCL V2 Leaderboard, the overall accuracy will be the **unweighted** average of each of the sub-categories`. - `"exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` - For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the **weighted** average of each of the Live sub-categories. - `"live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` 3. Simplification of Claude Handlers: - Previously, the codebase included two separate handlers: `ClaudeFCHandler` (for Claude models in FC mode) and `ClaudePromptingHandler` (for Claude models in prompting mode). - This PR merges these into a single `ClaudeHandler`, streamlining the code without altering functionality. 4. Improve Error Log Readability 5. resolve #485 --------- Co-authored-by: Charlie Cheng-Jie Ji <[email protected]> Co-authored-by: Fanjia Yan <[email protected]>
ShishirPatil · Aug 19, 2024 · 30124c4 · 30124c4
1 parent 3d89d60
commit 30124c4
Show file tree

Hide file tree

Showing 84 changed files with 7,996 additions and 4,021 deletions.
diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -104,6 +104,8 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
 |gpt-4o-2024-08-06 | Prompt|
 |gpt-4o-2024-05-13-FC | Function Calling|
 |gpt-4o-2024-05-13| Prompt|
+|gpt-4o-mini-2024-07-18-FC | Function Calling|
+|gpt-4o-mini-2024-07-18 | Prompt|
 |google/gemma-7b-it 💻| Prompt|
 |meetkai/functionary-medium-v3.1-FC| Function Calling|
 |meetkai/functionary-small-{v3.1,v3.2}-FC| Function Calling|
@@ -145,33 +147,42 @@ For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace an
 ### Available Test Category
 In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
 
-- `all`: Run all test categories.
-    - This is the default option if no test category is provided.
-- `ast`: Abstract Syntax Tree tests.
-- `executable`: Executable code evaluation tests.
-- `python`: Tests specific to Python code.
-- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
-- `python-ast`: Python Abstract Syntax Tree tests.
-- Individual test categories:
-    - `simple`: Simple function calls.
-    - `parallel_function`: Multiple function calls in parallel.
-    - `multiple_function`: Multiple function calls in sequence.
-    - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
-    - `executable_simple`: Executable function calls.
-    - `executable_parallel_function`: Executable multiple function calls in parallel.
-    - `executable_multiple_function`: Executable multiple function calls in sequence.
-    - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
-    - `java`: Java function calls.
-    - `javascript`: JavaScript function calls.
-    - `rest`: REST API function calls.
-    - `relevance`: Function calls with irrelevant function documentation.
-- If no test category is provided, the script will run all available test categories. (same as `all`)
-
-> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
-
-> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
-
-> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
+* Available test groups:
+  * `all`: All test categories.
+    * This is the default option if no test category is provided.
+  * `live`: All user-contributed live test categories.
+  * `non_live`: All not-user-contributed test categories (the opposite of `live`).
+  * `ast`: Abstract Syntax Tree tests.
+  * `executable`: Executable code evaluation tests.
+  * `python`: Tests specific to Python code.
+  * `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
+  * `python_ast`: Python Abstract Syntax Tree tests.
+* Available individual test categories:
+  * `simple`: Simple function calls.
+  * `parallel`: Multiple function calls in parallel.
+  * `multiple`: Multiple function calls in sequence.
+  * `parallel_multiple`: Multiple function calls in parallel and in sequence.
+  * `java`: Java function calls.
+  * `javascript`: JavaScript function calls.
+  * `exec_simple`: Executable function calls.
+  * `exec_parallel`: Executable multiple function calls in parallel.
+  * `exec_multiple`: Executable multiple function calls in parallel.
+  * `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
+  * `rest`: REST API function calls.
+  * `irrelevance`: Function calls with irrelevant function documentation.
+  * `live_simple`: User-contributed simple function calls.
+  * `live_multiple`: User-contributed multiple function calls in sequence.
+  * `live_parallel`: User-contributed multiple function calls in parallel.
+  * `live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
+  * `live_irrelevance`: User-contributed function calls with irrelevant function documentation.
+  * `live_relevance`: User-contributed function calls with relevant function documentation.
+* If no test category is provided, the script will run all available test categories. (same as `all`)
+
+> If you want to run the `all`, `non_live`, `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
+
+> If you do not wish to provide API keys for REST API testing, set `test-category` to any non-executable category.
+
+> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
 
 
 ## Evaluating the LLM generations
@@ -181,7 +192,7 @@ In the following two sections, the optional `--test-category` parameter can be u
 Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
 
 ```bash
-python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
+python eval_runner.py --model MODEL_NAME --test-category TEST_CATEGORY
 ```
 
 For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
@@ -202,16 +213,16 @@ If you want to evaluate all offline tests (do not require RapidAPI keys) for Ope
 python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
 ```
 
-If you want to run `rest` tests for a few Claude models, you can use the following command:
+If you want to run the `rest` tests for a few Claude models, you can use the following command:
 
 ```bash
 python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
 ```
 
-If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
+If you want to run `live_simple` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
 
 ```bash
-python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
+python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category live_simple javascript
 ```
 
 ### Model-Specific Optimization
@@ -221,6 +232,7 @@ Some companies have proposed some optimization strategies in their models' handl
 
 ## Changelog
 
+* [August 19, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
 * [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
 * [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
 * [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:

diff --git a/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v2_chatable.json b/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v2_chatable.json