Update `LLM.generate` output to include `statistics` #1034

plaguss · 2024-10-11T14:30:34Z

Description

This PR updates the output from llm.generate to make it more feature rich.

Previously we only returned the generated text:

GenerateOutput = List[Union[str, None]]

Now it is updated to allow for statistics related to the generation:

LLMOutput = List[Union[str, None]]

class TokenCount(TypedDict):
    input_tokens: List[int]
    output_tokens: List[int]

LLMStatistics = Union[TokenCount, Dict[str, Any]]
"""Initially the LLMStatistics will contain the token count, but can have more variables.
They can be added once we have them defined for every LLM.
"""

class GenerateOutput(TypedDict):
    generations: LLMOutput
    statistics: LLMStatistics

This PR only includes input_tokens and output_tokens as statistics, but we can add as much as needed in the future.

This information is moved to distilabel_metadata in the following way, to avoid collisions between statistics of different steps:

{
    "generations": ["Hello Magpie"],
    f"statistics_{step_name}": {
        "input_tokens": [12],
        "output_tokens": [12],
    },
}

NOTE:
Most Task reuse the same Task.process method to process the generations, and nothing else has to be done, but for tasks like Magpie where the process method is overwritten, this has to be updated.

Closes #738

github-actions · 2024-10-11T14:32:02Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1034/

codspeed-hq · 2024-10-11T14:34:38Z

CodSpeed Performance Report

Merging #1034 will not alter performance

_{Comparing llm-generate-upgrade (e97f901) with develop (7c8976b)}

Summary

✅ 1 untouched benchmarks

…s the outputs

…ponse

…ode duplication

…generations

… suite

…dd new merge_dicts to help merging user-assistant messages in magpie

…to llm-generate-upgrade

…ines complaining

…uding statistics

gabrielmbmb · 2024-11-15T08:36:20Z

src/distilabel/models/llms/utils.py

+    """
+    if isinstance(text_or_messages, list):
+        # If it's a list of messages, concatenate the content of each message
+        text = " ".join([message["content"] for message in text_or_messages])


I think it would be better to tokenize each message individually and then sum the results to be 100% precise.

gabrielmbmb · 2024-11-15T08:36:54Z

src/distilabel/models/llms/utils.py

+        input_tokens: The number of tokens of the inputs. Defaults to [0].
+        output_tokens: The number of tokens of the LLM response. Defaults to [0].


Suggested change

input_tokens: The number of tokens of the inputs. Defaults to [0].

output_tokens: The number of tokens of the LLM response. Defaults to [0].

input_tokens: The number of tokens of the inputs. Defaults to `None`.

output_tokens: The number of tokens of the LLM response. Defaults to `None`.

gabrielmbmb · 2024-11-15T08:38:04Z

src/distilabel/models/llms/vllm.py

+        }
+
+    @staticmethod
+    def _prepare_sorted_resuts(


Suggested change

def _prepare_sorted_resuts(

def _prepare_sorted_results(

gabrielmbmb · 2024-11-15T08:38:12Z

src/distilabel/models/llms/vllm.py

-            batched_outputs = _sort_batches(
-                batched_outputs, sorted_indices, num_generations=num_generations
+            # Sort the batched outputs together with the statistics
+            generations = self._prepare_sorted_resuts(


Suggested change

generations = self._prepare_sorted_resuts(

generations = self._prepare_sorted_results(

gabrielmbmb · 2024-11-15T08:38:36Z

src/distilabel/models/llms/vllm.py

+            )
+            statistics[field] = batched_field
+
+        # Regenerates the outputs as they are returned buy `preare_output`


Suggested change

# Regenerates the outputs as they are returned buy `preare_output`

# Regenerates the outputs as they are returned buy `prepare_output`

gabrielmbmb · 2024-11-15T08:39:06Z

src/distilabel/steps/clustering/text_clustering.py

@@ -312,6 +312,7 @@ def process(self, inputs: StepInput) -> "StepOutput":
            self._logger.info(f"📦 Processing internal batch of inputs {i}...")
            results = super().process(batched_inputs)
            for result in next(results):  # Extract the elements from the generator
+                print("INTERMEDIATE RESULTS", result)


Suggested change

print("INTERMEDIATE RESULTS", result)

gabrielmbmb · 2024-11-15T08:41:22Z

src/distilabel/steps/tasks/base.py

+    return output
+
+
+def iterate_generations_with_stats(output: "GenerateOutput") -> "GenerateOutput":


I think the return type hint should be Generator[Tuple[...], None, None]

gabrielmbmb · 2024-11-15T08:42:56Z

src/distilabel/steps/tasks/magpie/base.py

+        messages = [output["generations"][0] for output in outputs]
+        statistics = [output["statistics"] for output in outputs]


maybe we can combine this in the same for loop

gabrielmbmb · 2024-11-15T08:44:45Z

src/distilabel/steps/typing.py

+# """`StepOutput` is an alias of the typing.
+# A step output is a dict of the form:
+# {
+# 		"outputs": [
+#             {"col1": "val1", "col2": "val2"},
+#             {"col1": "val1", "col2": "val2"},
+#             {"col1": "val1", "col2": "val2"},
+# 		],
+# 		"statistics": {
+#             "llm": {},
+# 	        "time": 12341234,
+#             ...
+# 		}
+# }
+# """


This should be uncommented, right?

gabrielmbmb · 2024-11-15T08:45:35Z

src/distilabel/utils/dicts.py



 def flatten_dict(x: Dict[Any, Any]) -> Dict[Any, Any]:
    return {k: json.dumps(v) if isinstance(v, dict) else v for k, v in x.items()}
+
+
+def merge_dicts(*dict_lists):


Suggested change

def merge_dicts(*dict_lists):

def merge_dicts(*dict_lists: dict) -> list[dict]:

plaguss added 3 commits October 11, 2024 16:26

Update GenerateOutput type

caea13f

Add draft function to compute number of tokens given a tokenizer

f7d7a0e

Refactor llm generation to return generations and statistics

dbcfafa

plaguss added this to the 1.5.0 milestone Oct 11, 2024

plaguss self-assigned this Oct 11, 2024

plaguss added 2 commits October 14, 2024 07:06

Move statistics from the LLM to distilabel_metadata row

a0bf204

Update tests and LLM outputs to run with generations and statistics a…

a51ce59

…s the outputs

plaguss linked an issue Oct 14, 2024 that may be closed by this pull request

[FEATURE] Update LLM.generate interface to allow returning arbitrary/extra stuff related to the generation #738

Open

plaguss added 2 commits October 14, 2024 10:59

Openai computed tokens

fe5d4c5

First version of async llms with statistics

394984f

plaguss added the enhancement New feature or request label Oct 14, 2024

plaguss added 17 commits October 15, 2024 16:09

Return generations with list of strings and token count from _raw_res…

9003fa0

…ponse

Passing tests for inference endpoints

4cf0e2f

Testing vLLM with statistics

1b6d15c

Refactor statistics module to utils and output preparation to avoid c…

8923880

…ode duplication

Refactor to remove code duplication

608e8b6

Fix async llms not returning properly the generations grouped by num_…

8c35af5

…generations

Fix async llms not processing multiple generations

6d19de7

Fix vllm sorting mechanism and add mocked generate method to the test…

9746d75

… suite

Checkpoint

f108670

Fix tests from merge responses and group generations

c8063a4

Move import to guarded type hint

6f6769a

Fix tests to work with statistics

4971f26

Return void list in case of no generations

74f81ad

Update function to allow flatten inner list in values of dicts, and a…

8ff6e13

…dd new merge_dicts to help merging user-assistant messages in magpie

Fix dummy magpie llm

d8f2a8b

Update tests for magpie

70898da

Create statistics entry in distilabel_metadata with the name of the step

314e171

plaguss added 6 commits October 24, 2024 13:22

Update magpie code to work with the new llm.generate behaviour

e3e81d9

Update tests with the llm generate output format

241d899

Merge branch 'develop' of https://github.com/argilla-io/distilabel in…

a657859

…to llm-generate-upgrade

Fix pending tests

40e408d

Fix test failing with vllm version upgrade

6c2e1fd

Another fix including tokenizer for our llm to work and to avoid outl…

d28a798

…ines complaining

plaguss marked this pull request as ready for review October 25, 2024 07:17

Fix dummy offline batch generation

1bc28ba

plaguss requested a review from gabrielmbmb October 25, 2024 07:58

plaguss added 3 commits October 28, 2024 08:54

Merge and fix conflict

13f42b0

Compute tokens using the tokenizer if available

edbea28

Update docs to include references to the new outputs of the LLMs incl…

e97f901

…uding statistics

plaguss mentioned this pull request Oct 28, 2024

[FEATURE] Compute the input/output tokens of a dataset #1046

Open

plaguss mentioned this pull request Nov 8, 2024

Count dataset tokens #1055

Draft

gabrielmbmb approved these changes Nov 15, 2024

View reviewed changes

gabrielmbmb changed the title ~~Llm generate upgrade~~ Update LLM.generate output to include statistics Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `LLM.generate` output to include `statistics` #1034

Update `LLM.generate` output to include `statistics` #1034

plaguss commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

codspeed-hq bot commented Oct 11, 2024 •

edited

Loading

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

gabrielmbmb Nov 15, 2024

		input_tokens: The number of tokens of the inputs. Defaults to [0].
		output_tokens: The number of tokens of the LLM response. Defaults to [0].

	generations = self._prepare_sorted_resuts(
	generations = self._prepare_sorted_results(

	# Regenerates the outputs as they are returned buy `preare_output`
	# Regenerates the outputs as they are returned buy `prepare_output`

		return output


		def iterate_generations_with_stats(output: "GenerateOutput") -> "GenerateOutput":

		messages = [output["generations"][0] for output in outputs]
		statistics = [output["statistics"] for output in outputs]

	def merge_dicts(*dict_lists):
	def merge_dicts(*dict_lists: dict) -> list[dict]:

Update LLM.generate output to include statistics #1034

Are you sure you want to change the base?

Update LLM.generate output to include statistics #1034

Conversation

plaguss commented Oct 11, 2024 • edited Loading

Description

github-actions bot commented Oct 11, 2024

codspeed-hq bot commented Oct 11, 2024 • edited Loading

Merging #1034 will not alter performance

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Update `LLM.generate` output to include `statistics` #1034

Update `LLM.generate` output to include `statistics` #1034

plaguss commented Oct 11, 2024 •

edited

Loading

codspeed-hq bot commented Oct 11, 2024 •

edited

Loading