Effective LLM Inference Evaluation

Effective LLM Inference Evaluation is a project aimed at measuring the real-world performance of Large Language Model (LLM) inference frameworks, inspired by the concepts in deepspeed-fastgen.

In interactive applications like chat apps, traditional metrics like end-to-end latency don't fully capture user experience needs. Consider a chat scenario where a user sends a prompt, receives the first token, and then gets subsequent tokens as they're produced, delays at any stage can negatively impact the user experience. Given the varying lengths of prompts and responses, setting fixed SLA values for throughput and latency isn't feasible. So we define:

Prompt latency SLA： The Prompt Latency SLA is defined as a target processing speed of p tokens per second, based on the length of the user's input prompt.
Generation latency SLA: The Generation Latency SLA sets the target for delivering generated responses at a rate of g tokens per second, aligned with human reading speeds.

Requests meeting these standards are successful, and their throughput is termed 'effective throughput'.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
elie		elie
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Effective LLM Inference Evaluation

About

Releases

Packages

Languages

License

UranusSeven/Effective-LLM-Inference-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Effective LLM Inference Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages