-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Token Usage Tracking #27
Conversation
Thank you very much for the great PR, @ZachZimm! |
Can we implement this as well? We need 1, to make it easier to define when we need it. And 2 is a simple implementation, all you need to do is to pass the usage after streaming all tokens and just before yielding done (i.e., https://github.com/Blaizzy/fastmlx/blob/main/fastmlx/utils.py#L407). |
Please run pre-commit to fix styling check errors. |
I pushed a new commit based on your feedback. I did run pre-commit this time (sorry, I'd never heard of it before) and there shouldn't be any extra comments or prints. |
@ZachZimm please update the tests to support token info, this is what causing them to fail. |
This PR resolves issue #8 by adding token usage information to the response messages. The response conforms to the schema I found here for non-streaming and here for streaming, at the very bottom.
In this implementation, all streamed chunks include up-to-date information about token usage.
This streaming implementation differs slightly from openai's in that their api accepts a
stream_options: dict
parameter which can include"include_usage"=True
, otherwise it will not stream back usage information. Additionally, they put forth a few rules about which chunks of the stream will contain usage data.I did not add a
count_tokens
function as suggested in the issue, as it seems better to do this during generation since we are already using the tokenizer and in the case of non-streaming we already had an array of prompt tokens.