-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue at whisper in many aspects : latency, reproducibility, and more #1740
Comments
We are investigating internally. |
@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here. Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16. After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ? You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly. |
@yuekaizhang |
@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?
Just remove the --fp32 options in your commands. |
With my model, removing fp32 options? |
@yuekaizhang
Let me share my build script.
|
@lionsheep24 Our internal fix which may related to this issue would sync to github in a week. Or you could manually convert your model to fp16 first. E.g. model = model.half() |
@yuekaizhang
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio? |
where can I find |
Please check "/TensorRT-LLM/examples/whisper/distil_whisper/convert_from_distil_whisper.py" |
@lionsheep24 for streaming purpose what was your analysis in terms of approach and results? wont 1 sec audio chunks hurt the accuracy of the transcriptions(as it is mentioned to use 30 secs chunks)? |
@yuekaizhang why using fp16 instead of fp32? I see that tensorrt-llm also support whisper with fp32 https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/reference/precision.md |
System Info
Who can help?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I benchmarked trtllm-whisper served by triton, (built by newer version, the trtllm-build command. older ver was built by python build.py) but It was slower than flash-attention-implemented huggingface, faster whisper. The bottleneck of latency was decoding, which was about 500~700ms. (for 1s audio).
Also the transcription result was not correct and inconsistent even with max_beam_width of 1. I remember the built by older trtllm version was good in transcription.
After multiple tests, I tried to terminate tritonserver, but below error has thrown.
Any help or advice would be appreciated!
My project is combiation of official whisper example, trtllm-python backend implementation and triton client example
I compiled my fine-tuned, huggingface whisper with below procedures.
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2/2 --output_dir /workspace/models/whisper-openai --output_name large-v2
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2 --dtype float32 --logits_dtype float32
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float32 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 16 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float32 --bert_attention_plugin float32 --gpt_attention_plugin float32 --remove_input_padding disable
Expected behavior
Faster than huggingface, faster whisper with consistent cer performance
actual behavior
Slow inference,(RTF was about 1.0), inconsistent transcription result, and the server was unstable.
additional notes
Let me share my dockerfiles for reproduce this issue.
The text was updated successfully, but these errors were encountered: