Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

lionsheep24 · 2024-06-05T16:01:11Z

System Info

GPU : A100 (80G)
Driver : 550.54.15
CPU : x86-64
Docker base image : nvcr.io/nvidia/tritonserver:24.03-py3
tensorrt-llm version : 0.11.0.dev2024060400

Who can help?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I benchmarked trtllm-whisper served by triton, (built by newer version, the trtllm-build command. older ver was built by python build.py) but It was slower than flash-attention-implemented huggingface, faster whisper. The bottleneck of latency was decoding, which was about 500~700ms. (for 1s audio).

Also the transcription result was not correct and inconsistent even with max_beam_width of 1. I remember the built by older trtllm version was good in transcription.

After multiple tests, I tried to terminate tritonserver, but below error has thrown.
Any help or advice would be appreciated!

[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): an illegal memory access was encountered (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:168)
1       0x7f2a9666ae9a tensorrt_llm::runtime::MemoryPool<tensorrt_llm::runtime::PinnedAllocator>::~MemoryPool() + 282
2       0x7f2cd0819495 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45495) [0x7f2cd0819495]
3       0x7f2cd0819610 on_exit + 0
4       0x7f2cd07fdd97 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d97) [0x7f2cd07fdd97]
5       0x7f2cd07fde40 __libc_start_main + 128
6       0x560e37d701a5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x271a5) [0x560e37d701a5]

My project is combiation of official whisper example, trtllm-python backend implementation and triton client example

I compiled my fine-tuned, huggingface whisper with below procedures.

convert hf to openai model : python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2/2 --output_dir /workspace/models/whisper-openai --output_name large-v2
convert checkpoint to tensorrt-llm way : python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2 --dtype float32 --logits_dtype float32
Build trtllm encoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float32 --remove_input_padding disable
Build decoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 16 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float32 --bert_attention_plugin float32 --gpt_attention_plugin float32 --remove_input_padding disable

Expected behavior

Faster than huggingface, faster whisper with consistent cer performance

actual behavior

Slow inference,(RTF was about 1.0), inconsistent transcription result, and the server was unstable.

additional notes

Let me share my dockerfiles for reproduce this issue.

For model compile

# Use the NVIDIA CUDA image with development tools and Ubuntu 22.04
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
#FROM nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
# Set the working directory
WORKDIR /workspace

# Environment variables for MPI
ENV MPI_HOME=/usr/local/mpi
ENV PATH="$MPI_HOME/bin:$PATH"
ENV LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"

# Install necessary packages
RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

# copy pip.conf
COPY .tmp/pip.conf /root/.config/pip/pip.conf
# copy cacert.pem
COPY .tmp/cacert.pem /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

# Inform Git about the CA bundle for certificate verification
RUN git config --global http.sslCAInfo /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

# Upgrade pip and install necessary Python packages
RUN pip install --upgrade pip setuptools wheel


# Clone the TensorRT-LLM repository
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git /workspace/TensorRT-LLM && \
    cd /workspace/TensorRT-LLM && \
    git checkout b777bd6
WORKDIR /workspace/TensorRT-LLM
#RUN pip install -r examples/whisper/requirements.txt

RUN pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 tiktoken datasets kaldialign openai-whisper librosa soundfile safetensors transformers janus

# Setup Git LFS
RUN git lfs install

COPY models/whisper-large-v2 /workspace/models/whisper-large-v2
COPY ./assets /workspace/TensorRT-LLM/examples/whisper/assets

For tritonserver

FROM nvcr.io/nvidia/tritonserver:24.03-py3

RUN apt update && apt-get install -y ffmpeg
RUN python3 -m pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400
RUN python3 -m pip install mpmath==1.3.0 gradio==3.50.2 tritonclient[all]


COPY stt_task/tensorrt_llm/triton/requirements.txt /workspace/requirements.txt
WORKDIR /workspace
RUN python3 -m pip install -r requirements.txt

# COPY model
COPY ./models/whisper_large_v2_tensorrt_llm /workspace/models/whisper-large-v2-tensorrt-llm/1/whisper-large-v2

# COPY src
COPY ./stt/triton/server /workspace/models/whisper-large-v2-tensorrt-llm/1
COPY ./config.pbtxt /workspace/models/whisper-large-v2-tensorrt-llm
COPY ./launch_server.sh /workspace/launch_server.sh

The text was updated successfully, but these errors were encountered:

hijkzzz · 2024-06-05T22:49:10Z

We are investigating internally.

yuekaizhang · 2024-06-06T03:09:25Z

@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.

Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.

After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?

You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.

lionsheep24 · 2024-06-06T08:11:33Z

@yuekaizhang
Run convert_checkpoint with fp16 argument, you mean? since my audio sample is 1s audio and the results are clear. However, no results were obtained.

yuekaizhang · 2024-06-06T08:15:29Z

@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

Run convert_checkpoint with fp16 argument, you mean?

Just remove the --fp32 options in your commands.

lionsheep24 · 2024-06-06T08:33:21Z

@yuekaizhang

@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

With my model, removing fp32 options?

lionsheep24 · 2024-06-07T04:16:24Z

@yuekaizhang
Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
    model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

yuekaizhang · 2024-06-07T05:55:13Z

@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
    model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

@lionsheep24 Our internal fix which may related to this issue would sync to github in a week. Or you could manually convert your model to fp16 first. E.g. model = model.half()

lionsheep24 · 2024-06-07T11:22:54Z

@yuekaizhang
As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue.
The wer problem was fixed. The root cause was language prompt.
Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)

Method	Latency (sec)
tensorrt-llm	0.21
faster-whisper	1.43
huggingface	1.7
openai	2.1

p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

Skywalker-Harrison · 2024-08-13T09:56:45Z

@yuekaizhang As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue. The wer problem was fixed. The root cause was language prompt. Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)

Method Latency (sec)
tensorrt-llm 0.21
faster-whisper 1.43
huggingface 1.7
openai 2.1
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

where can I find model = AutoModel.from_pretrained(model_name, use_safetensors=True)?

Jeevi10 · 2024-08-15T19:22:25Z

@yuekaizhang As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue. The wer problem was fixed. The root cause was language prompt. Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)
Method Latency (sec)
tensorrt-llm 0.21
faster-whisper 1.43
huggingface 1.7
openai 2.1
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

where can I find model = AutoModel.from_pretrained(model_name, use_safetensors=True)?

Please check "/TensorRT-LLM/examples/whisper/distil_whisper/convert_from_distil_whisper.py"
line 59

haiderasad · 2024-10-07T11:54:44Z

@lionsheep24 for streaming purpose what was your analysis in terms of approach and results? wont 1 sec audio chunks hurt the accuracy of the transcriptions(as it is mentioned to use 30 secs chunks)?

cuongkn · 2025-01-23T07:44:26Z

@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.

Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.

After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?

You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.

@yuekaizhang why using fp16 instead of fp32? I see that tensorrt-llm also support whisper with fp32 https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/reference/precision.md

lionsheep24 added the bug Something isn't working label Jun 5, 2024

hijkzzz added the Investigating label Jun 5, 2024

hijkzzz self-assigned this Jun 5, 2024

hijkzzz assigned yuekaizhang Jun 6, 2024

hijkzzz closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

lionsheep24 commented Jun 5, 2024 •

edited

Loading

hijkzzz commented Jun 5, 2024 •

edited

Loading

yuekaizhang commented Jun 6, 2024 •

edited

Loading

lionsheep24 commented Jun 6, 2024

yuekaizhang commented Jun 6, 2024 •

edited

Loading

lionsheep24 commented Jun 6, 2024

lionsheep24 commented Jun 7, 2024 •

edited

Loading

yuekaizhang commented Jun 7, 2024

lionsheep24 commented Jun 7, 2024

Skywalker-Harrison commented Aug 13, 2024

Jeevi10 commented Aug 15, 2024

haiderasad commented Oct 7, 2024

cuongkn commented Jan 23, 2025

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

Comments

lionsheep24 commented Jun 5, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

hijkzzz commented Jun 5, 2024 • edited Loading

yuekaizhang commented Jun 6, 2024 • edited Loading

lionsheep24 commented Jun 6, 2024

yuekaizhang commented Jun 6, 2024 • edited Loading

lionsheep24 commented Jun 6, 2024

lionsheep24 commented Jun 7, 2024 • edited Loading

yuekaizhang commented Jun 7, 2024

lionsheep24 commented Jun 7, 2024

Skywalker-Harrison commented Aug 13, 2024

Jeevi10 commented Aug 15, 2024

haiderasad commented Oct 7, 2024

cuongkn commented Jan 23, 2025

lionsheep24 commented Jun 5, 2024 •

edited

Loading

hijkzzz commented Jun 5, 2024 •

edited

Loading

yuekaizhang commented Jun 6, 2024 •

edited

Loading

yuekaizhang commented Jun 6, 2024 •

edited

Loading

lionsheep24 commented Jun 7, 2024 •

edited

Loading