LLM Meta - token limit definition #150

Tomas2D · 2024-11-06T17:24:38Z

Right now, the BaseLLM (class)[/src/llms/base.ts] defines an abstract method called meta that provides meta information about a given model. The response interface (LLMMeta) defines a single property called tokenLimit.

The problem is that typically, tokenLimit is not enough as typically providers further subdivide limits into the following:

input (max input tokens) - for WatsonX, this field is called max_sequence_length.
output (max generated tokens) - for WatsonX, this field is called max_output_tokens.

Because TokenMemory behavior heavily depends on the tokenLimit value, we must be sure that we are not throwing messages out because we have retrieved the wrong value from an LLM provider.

The Solution to this issue is to develop (figure out) a better approach that would play nicely with TokenMemory and other practical usages.

Relates to #159 (Granite context window limit)

The text was updated successfully, but these errors were encountered:

michael-desmond · 2024-11-11T16:00:43Z

So tokenLimit would then be (max_sequence_length - max_output_tokens)? ensuring that the input context does not get trimmed during inference.

Seems reasonable that concrete LLM implementations would need to override this method, and provide a tokenLimit based on the llm context window and max new tokens. Is there something else that I am not considering here?

Tomas2D · 2024-11-12T20:14:04Z

Based on my observations, we can look at token limits from 3 different perspectives:

max token input size
max token output size (max number of generated tokens)
max model context window size

Example with WatsonX and Granite 3

max token input size (max_sequence_length property, the current value is 4096)
max token output size (max_output_tokens property, the current value is 8096)
max token context window size (not defined, but actually, it is the max_output_tokens property which is 8096)

For BAM, only the size of the context window is provided. To detect the max input size, you have to invoke an LLM call with max_new_tokens: 9999999 to trigger an error saying property 'max_new_tokens' must be <= XXXX

For OLLama, only the context window size is provided. Other values seem to be unlimited (no validation error).

For OpenAI, no values are provided; everything must be hard-coded. Limits can be obtained only from an error message.

For Groq, no values are provided (the API probably works similarly to OpenAI).

Now the question is which limit should be passed on TokenMemory and how the TokenMemory should behave. In the context of Granite, let's say that my current memory limit is 3000. Now, If I add a new message that has 2000 tokens, it would force the TokenMemory to remove some old messages (one or more depending on their sizes) to stay under 4096 tokens (because we had initiated the TokenMemory with such value). Regarding Bee Agent + TokenMemory, this could lead to a situation where the agent (runner) may trigger an error because of this check.

How would you tackle this?

Tomas2D added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Nov 6, 2024

mmurad2 assigned michael-desmond Nov 12, 2024

Tomas2D self-assigned this Nov 12, 2024

mmurad2 assigned vabarbosa Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Meta - token limit definition #150

LLM Meta - token limit definition #150

Tomas2D commented Nov 6, 2024 •

edited

Loading

michael-desmond commented Nov 11, 2024

Tomas2D commented Nov 12, 2024

LLM Meta - token limit definition #150

LLM Meta - token limit definition #150

Comments

Tomas2D commented Nov 6, 2024 • edited Loading

michael-desmond commented Nov 11, 2024

Tomas2D commented Nov 12, 2024

Tomas2D commented Nov 6, 2024 •

edited

Loading