Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Meta - token limit definition #150

Open
Tomas2D opened this issue Nov 6, 2024 · 2 comments
Open

LLM Meta - token limit definition #150

Tomas2D opened this issue Nov 6, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@Tomas2D
Copy link
Contributor

Tomas2D commented Nov 6, 2024

Right now, the BaseLLM (class)[/src/llms/base.ts] defines an abstract method called meta that provides meta information about a given model. The response interface (LLMMeta) defines a single property called tokenLimit.

The problem is that typically, tokenLimit is not enough as typically providers further subdivide limits into the following:

  • input (max input tokens) - for WatsonX, this field is called max_sequence_length.
  • output (max generated tokens) - for WatsonX, this field is called max_output_tokens.

Because TokenMemory behavior heavily depends on the tokenLimit value, we must be sure that we are not throwing messages out because we have retrieved the wrong value from an LLM provider.

The Solution to this issue is to develop (figure out) a better approach that would play nicely with TokenMemory and other practical usages.

Relates to #159 (Granite context window limit)

@Tomas2D Tomas2D added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Nov 6, 2024
@michael-desmond
Copy link
Contributor

So tokenLimit would then be (max_sequence_length - max_output_tokens)? ensuring that the input context does not get trimmed during inference.

Seems reasonable that concrete LLM implementations would need to override this method, and provide a tokenLimit based on the llm context window and max new tokens. Is there something else that I am not considering here?

@Tomas2D
Copy link
Contributor Author

Tomas2D commented Nov 12, 2024

Based on my observations, we can look at token limits from 3 different perspectives:

  • max token input size
  • max token output size (max number of generated tokens)
  • max model context window size

Example with WatsonX and Granite 3

  • max token input size (max_sequence_length property, the current value is 4096)
  • max token output size (max_output_tokens property, the current value is 8096)
  • max token context window size (not defined, but actually, it is the max_output_tokens property which is 8096)

For BAM, only the size of the context window is provided. To detect the max input size, you have to invoke an LLM call with max_new_tokens: 9999999 to trigger an error saying property 'max_new_tokens' must be <= XXXX

For OLLama, only the context window size is provided. Other values seem to be unlimited (no validation error).

For OpenAI, no values are provided; everything must be hard-coded. Limits can be obtained only from an error message.

For Groq, no values are provided (the API probably works similarly to OpenAI).

Now the question is which limit should be passed on TokenMemory and how the TokenMemory should behave. In the context of Granite, let's say that my current memory limit is 3000. Now, If I add a new message that has 2000 tokens, it would force the TokenMemory to remove some old messages (one or more depending on their sizes) to stay under 4096 tokens (because we had initiated the TokenMemory with such value). Regarding Bee Agent + TokenMemory, this could lead to a situation where the agent (runner) may trigger an error because of this check.

How would you tackle this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants