Skip to content
This repository has been archived by the owner on Dec 26, 2024. It is now read-only.

A project aimed at measuring the real-world performance of Large Language Model (LLM) inference frameworks, inspired by the concepts in deepspeed-fastgen.

License

Notifications You must be signed in to change notification settings

UranusSeven/Effective-LLM-Inference-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Effective LLM Inference Evaluation

Effective LLM Inference Evaluation is a project aimed at measuring the real-world performance of Large Language Model (LLM) inference frameworks, inspired by the concepts in deepspeed-fastgen.

In interactive applications like chat apps, traditional metrics like end-to-end latency don't fully capture user experience needs. Consider a chat scenario where a user sends a prompt, receives the first token, and then gets subsequent tokens as they're produced, delays at any stage can negatively impact the user experience. Given the varying lengths of prompts and responses, setting fixed SLA values for throughput and latency isn't feasible. So we define:

  1. Prompt latency SLA: The Prompt Latency SLA is defined as a target processing speed of p tokens per second, based on the length of the user's input prompt.
  2. Generation latency SLA: The Generation Latency SLA sets the target for delivering generated responses at a rate of g tokens per second, aligned with human reading speeds.

Requests meeting these standards are successful, and their throughput is termed 'effective throughput'.

About

A project aimed at measuring the real-world performance of Large Language Model (LLM) inference frameworks, inspired by the concepts in deepspeed-fastgen.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages