中文 | English | 💬discord

Overview

🚀 Exceptional Experience with UltraEval-Audio 🚀

UltraEval-Audio -- the world's first open-source framework that simultaneously supports both speech understanding and speech generation evaluation, specifically designed for assessing large audio models. It integrates 34 authoritative benchmarks, covering four major fields: speech, sound, healthcare, and music, supporting ten languages and twelve types of tasks. With UltraEval-Audio, you will experience unprecedented convenience and efficiency:

One-Click Benchmark Management 📥: Say goodbye to tedious manual downloads and data processing. UltraEval-Audio automates all of this, allowing you to easily access the benchmark test data you need.
Built-In Evaluation Tools ⚙️: No need to search for evaluation tools elsewhere. UltraEval-Audio comes equipped with eight commonly used evaluation methods (e.g., WER, WER-ZH, BLEU, G-Eval), meeting your needs whether they are rule-based or model-driven.
Powerful and User-Friendly 🛠️: Supports preview testing, random sampling, error retries, and checkpoint resuming, ensuring a flexible and controllable evaluation process while improving efficiency and accuracy.
Seamless Custom Dataset Integration 💼: Not only does it support public benchmarks, but it also provides robust custom dataset functionality, enabling quick application in various engineering scenarios.
Easy Integration with Existing Systems 🔗: With excellent scalability and standardized design, UltraEval-Audio can seamlessly integrate even if you already have a well-established evaluation system, simplifying project management and delivering unified, standardized results.

Leaderboard

Audio Understanding LLM: Speech + Text → Text

Audio Generation LLM: Speech → Speech

Audio Understanding LLM Leaderboard

Rank	Model	ASR	AST
🏅	MiniCPM-o 2.6	96	38
🥈	Gemini-1.5-Pro	94	35
🥉	qwen2-audio-instruction	94	31
4	GPT-4o-Realtime	92	26
5	Gemini-1.5-Flash	49	21
6	Qwen-Audio-Chat	3	12

Audio Generation LLM Leaderboard

Rank	Model	Semantic	Acoustic	AudioArena
🏅	GPT-4o-Realtime	67	84	1200
🥈	MiniCPM-o 2.6	48	80	1131
🥉	GLM-4-Voice	42	82	1035
4	Mini-Omni	16	64	897
5	Llama-Omni	29	54	875
6	Moshi	27	68	865

For detailed performance metrics of audio LLMs, please refer to leaderboard.md

Support datasets

Changelog🔥

[2025/01/13] release v1.0.0

Quick Start

ready env

git clone https://github.com/OpenBMB/UltraEval-Audio.git
cd UltraEval-Audio
conda create -n aduioeval python=3.10 -y
conda activate aduioeval
pip install -r requirments.txt

run

export PYTHONPATH=$PWD:$PYTHONPATH

# eval gpt-4o-realtime text modal model
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset catdog --model gpt4o_audio

# eval gpt-4o-realtime audio modal model
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset llama-questions-s2t --model gpt4o_speech

# you can use gpt-4o-realtime in AZURE
export AZURE_OPENAI_URL=$your-key
export AZURE_OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset catdog --model gpt4o_speech_ms


# eval gemini-1.5-pro model
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset catdog --model gemini-pro


# eval qwen2-audio  offline model in local
pip install -r requirments-offline-model.txt
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --model qwen2-audio-chat

If you encounter an error, you can first check FAQ

res

After program executed, you will get the performance in console and detail result as below:

- res
    |-- $time-$name-$dataset.jsonl

Usage

To run the evaluation script, use the following command:

python audio_evals/main.py --dataset <dataset_name> --model <model_name>

Dataset Options

The --dataset parameter allows you to specify which dataset to use for evaluation. The following options are available:

speech-chatbot-alpaca-eval
llama-questions
speech-web-questions
speech-triviaqa
tedlium-release1
tedlium-release2
tedlium-release3
catdog
audiocaps
covost2-en-ar
covost2-en-ca
covost2-en-cy
covost2-en-de
covost2-en-et
covost2-en-fa
covost2-en-id
covost2-en-ja
covost2-en-lv
covost2-en-mn
covost2-en-sl
covost2-en-sv
covost2-en-ta
covost2-en-tr
covost2-en-zh
covost2-zh-en
covost2-it-en
covost2-fr-en
covost2-es-en
covost2-de-en
GTZAN
TESS
nsynth
meld-emo
meld-sentiment
clotho-aqa
ravdess-emo
ravdess-gender
COVID-recognizer
respiratory-crackles
respiratory-wheezes
KeSpeech
audio-MNIST
librispeech-test-clean
librispeech-dev-clean
librispeech-test-other
librispeech-dev-other
mls_dutch
mls_french
mls_german
mls_italian
mls_polish
mls_portuguese
mls_spanish
heartbeat_sound
vocalsound
fleurs-zh
voxceleb1
voxceleb2
chord-recognition
wavcaps-audioset
wavcaps-freesound
wavcaps-soundbible
air-foundation
air-chat
desed
peoples-speech
WenetSpeech-test-meeting
WenetSpeech-test-net
gigaspeech
aishell-1
cv-15-en
cv-15-zh
cv-15-fr
cv-15-yue

support dataset detail

<dataset_name>	name	task	domain	metric
tedlium-*	tedlium	ASR(Automatic Speech Recognition)	speech	wer
clotho-aqa	ClothoAQA	AQA(AudioQA)	sound	acc
catdog	catdog	AQA	sound	acc
mls-*	multilingual_librispeech	ASR	speech	wer
KeSpeech	KeSpeech	ASR	speech	cer
librispeech-*	librispeech	ASR	speech	wer
fleurs-*	FLEURS	ASR	speech	wer
aisheel1	AISHELL-1	ASR	speech	wer
WenetSpeech-*	WenetSpeech	ASR	speech	wer
covost2-*	covost2	STT(Speech Text Translation)	speech	BLEU
GTZAN	GTZAN	MQA(MusicQA)	music	acc
TESS	TESS	EMO(emotional recognition)	speech	acc
nsynth	nsynth	MQA	music	acc
meld-emo	meld	EMO	speech	acc
meld-sentiment	meld	SEN(sentiment recognition)	speech	acc
ravdess-emo	ravdess	EMO	speech	acc
ravdess-gender	ravdess	GEND(gender recognition)	speech	acc
COVID-recognizer	COVID	MedicineCls	medicine	acc
respiratory-*	respiratory	MedicineCls	medicine	acc
audio-MNIST	audio-MNIST	AQA	speech	acc
heartbeat_sound	heartbeat	MedicineCls	medicine	acc
vocalsound	vocalsound	MedicineCls	medicine	acc
voxceleb*	voxceleb	GEND	speech	acc
chord-recognition	chord	MQA	music	acc
wavcaps-*	wavcaps	AC(AudioCaption)	sound	acc
air-foundation	AIR-BENCH	AC,GEND,MQA,EMO	sound,music,speech	acc
air-chat	AIR-BENCH	AC,GEND,MQA,EMO	sound,music,speech	GPT4-score
desed	desed	AQA	sound	acc
peoples-speech	peoples-speech	ASR	speech	wer
gigaspeech	gigaspeech	ASR	speech	wer
cv-15-*	common voice 15	ASR	speech	wer

eval your dataset: docs/how add a dataset.md

Model Options

The --model parameter allows you to specify which model to use for evaluation. The following options are available:

gpt4o_audio: Use the gpt-4o-realtime-preview-2024-10-01 audio to text modal model.
gpt4o_speech: Use the gpt-4o-realtime-preview-2024-10-01 audio to speech modal model.
gpt4o_audio_ms: Use the gpt-4o-realtime-preview-2024-10-01(in AZURE) audio to text modal model.
gpt4o_speech_ms: Use the gpt-4o-realtime-preview-2024-10-01(in AZURE) audio to speech modal model.
gpt4o_speech: Use the Ggpt-4o-realtime-preview-2024-10-01 audio to speech modal model.
gemini-pro: Use the Gemini Pro model.
gemini-1.5-pro: Use the Gemini 1.5 Pro model.
gemini-1.5-flash: Use the Gemini 1.5 Flash model.
gemini-2.0-flash-exp: Use the Gemini 2.0 Flash model.
qwen-audio: Use the qwen-audio-chat API model.
qwen2-audio-offline: Use the Qwen2-Audio-7B offline model.
qwen2-audio-chat: Use the Qwen2-Audio-7B-Instruct offline model.
qwen-audio-chat-offline: Use the Qwen-Audio-Chat offline model.
qwen-audio-pretrain-offline: Use the Qwen-Audio offline model.
ultravox: Use the ultravox-v0_4 offline model.

offline speech2speech models(e.g. glm4voice,mini-omni...) coming soon...

eval your model: docs/how eval your model.md

Acknowledgement

We refer to registry code in evals

Contact us

If you have any questions, suggestions, or feature requests related to AudioEvals, we encourage you to submit GitHub Issues to help us collaboratively build an open and transparent UltraEval evaluation community. Alternatively, you can join our Discord group: https://discord.gg/PHGy66QP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

中文 | English | 💬discord

Overview

🚀 Exceptional Experience with UltraEval-Audio 🚀

Leaderboard

Audio Understanding LLM Leaderboard

Audio Generation LLM Leaderboard

Support datasets

Changelog🔥

Quick Start

ready env

run

res

Usage

Dataset Options

support dataset detail

Model Options

Acknowledgement

Contact us

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

中文 | English | 💬discord

Overview

🚀 Exceptional Experience with UltraEval-Audio 🚀

Leaderboard

Audio Understanding LLM Leaderboard

Audio Generation LLM Leaderboard

Support datasets

Changelog🔥

Quick Start

ready env

run

res

Usage

Dataset Options

support dataset detail

Model Options

Acknowledgement

Contact us