Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

Closed
3 of 4 tasks
lionsheep24 opened this issue Jun 5, 2024 · 12 comments
Closed
3 of 4 tasks
Assignees
Labels
bug Something isn't working Investigating

Comments

@lionsheep24
Copy link

lionsheep24 commented Jun 5, 2024

System Info

  • GPU : A100 (80G)
  • Driver : 550.54.15
  • CPU : x86-64
  • Docker base image : nvcr.io/nvidia/tritonserver:24.03-py3
  • tensorrt-llm version : 0.11.0.dev2024060400

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I benchmarked trtllm-whisper served by triton, (built by newer version, the trtllm-build command. older ver was built by python build.py) but It was slower than flash-attention-implemented huggingface, faster whisper. The bottleneck of latency was decoding, which was about 500~700ms. (for 1s audio).

Also the transcription result was not correct and inconsistent even with max_beam_width of 1. I remember the built by older trtllm version was good in transcription.

After multiple tests, I tried to terminate tritonserver, but below error has thrown.
Any help or advice would be appreciated!

[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): an illegal memory access was encountered (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:168)
1       0x7f2a9666ae9a tensorrt_llm::runtime::MemoryPool<tensorrt_llm::runtime::PinnedAllocator>::~MemoryPool() + 282
2       0x7f2cd0819495 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45495) [0x7f2cd0819495]
3       0x7f2cd0819610 on_exit + 0
4       0x7f2cd07fdd97 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d97) [0x7f2cd07fdd97]
5       0x7f2cd07fde40 __libc_start_main + 128
6       0x560e37d701a5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x271a5) [0x560e37d701a5]

My project is combiation of official whisper example, trtllm-python backend implementation and triton client example

I compiled my fine-tuned, huggingface whisper with below procedures.

  1. convert hf to openai model : python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2/2 --output_dir /workspace/models/whisper-openai --output_name large-v2
  2. convert checkpoint to tensorrt-llm way : python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2 --dtype float32 --logits_dtype float32
  3. Build trtllm encoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float32 --remove_input_padding disable
  4. Build decoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 16 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float32 --bert_attention_plugin float32 --gpt_attention_plugin float32 --remove_input_padding disable

Expected behavior

Faster than huggingface, faster whisper with consistent cer performance

actual behavior

Slow inference,(RTF was about 1.0), inconsistent transcription result, and the server was unstable.

additional notes

Let me share my dockerfiles for reproduce this issue.

  1. For model compile
# Use the NVIDIA CUDA image with development tools and Ubuntu 22.04
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
#FROM nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
# Set the working directory
WORKDIR /workspace

# Environment variables for MPI
ENV MPI_HOME=/usr/local/mpi
ENV PATH="$MPI_HOME/bin:$PATH"
ENV LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"

# Install necessary packages
RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

# copy pip.conf
COPY .tmp/pip.conf /root/.config/pip/pip.conf
# copy cacert.pem
COPY .tmp/cacert.pem /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

# Inform Git about the CA bundle for certificate verification
RUN git config --global http.sslCAInfo /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

# Upgrade pip and install necessary Python packages
RUN pip install --upgrade pip setuptools wheel


# Clone the TensorRT-LLM repository
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git /workspace/TensorRT-LLM && \
    cd /workspace/TensorRT-LLM && \
    git checkout b777bd6
WORKDIR /workspace/TensorRT-LLM
#RUN pip install -r examples/whisper/requirements.txt

RUN pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 tiktoken datasets kaldialign openai-whisper librosa soundfile safetensors transformers janus

# Setup Git LFS
RUN git lfs install

COPY models/whisper-large-v2 /workspace/models/whisper-large-v2
COPY ./assets /workspace/TensorRT-LLM/examples/whisper/assets

  1. For tritonserver
FROM nvcr.io/nvidia/tritonserver:24.03-py3

RUN apt update && apt-get install -y ffmpeg
RUN python3 -m pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400
RUN python3 -m pip install mpmath==1.3.0 gradio==3.50.2 tritonclient[all]


COPY stt_task/tensorrt_llm/triton/requirements.txt /workspace/requirements.txt
WORKDIR /workspace
RUN python3 -m pip install -r requirements.txt

# COPY model
COPY ./models/whisper_large_v2_tensorrt_llm /workspace/models/whisper-large-v2-tensorrt-llm/1/whisper-large-v2

# COPY src
COPY ./stt/triton/server /workspace/models/whisper-large-v2-tensorrt-llm/1
COPY ./config.pbtxt /workspace/models/whisper-large-v2-tensorrt-llm
COPY ./launch_server.sh /workspace/launch_server.sh
@lionsheep24 lionsheep24 added the bug Something isn't working label Jun 5, 2024
@hijkzzz hijkzzz self-assigned this Jun 5, 2024
@hijkzzz
Copy link
Collaborator

hijkzzz commented Jun 5, 2024

We are investigating internally.

@yuekaizhang
Copy link

yuekaizhang commented Jun 6, 2024

@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.

Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.

After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?

You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.

@lionsheep24
Copy link
Author

@yuekaizhang
Run convert_checkpoint with fp16 argument, you mean? since my audio sample is 1s audio and the results are clear. However, no results were obtained.

@yuekaizhang
Copy link

yuekaizhang commented Jun 6, 2024

@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

Run convert_checkpoint with fp16 argument, you mean?

Just remove the --fp32 options in your commands.

@lionsheep24
Copy link
Author

@yuekaizhang

@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

With my model, removing fp32 options?

@lionsheep24
Copy link
Author

lionsheep24 commented Jun 7, 2024

@yuekaizhang
Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
    model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

  1. python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
  2. trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

@yuekaizhang
Copy link

@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
    model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

  1. python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
  2. trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

@lionsheep24 Our internal fix which may related to this issue would sync to github in a week. Or you could manually convert your model to fp16 first. E.g. model = model.half()

@lionsheep24
Copy link
Author

@yuekaizhang
As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue.
The wer problem was fixed. The root cause was language prompt.
Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)

Method Latency (sec)
tensorrt-llm 0.21
faster-whisper 1.43
huggingface 1.7
openai 2.1

p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

@hijkzzz hijkzzz closed this as completed Jun 14, 2024
@Skywalker-Harrison
Copy link

@yuekaizhang As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue. The wer problem was fixed. The root cause was language prompt. Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)

Method Latency (sec)
tensorrt-llm 0.21
faster-whisper 1.43
huggingface 1.7
openai 2.1
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

where can I find model = AutoModel.from_pretrained(model_name, use_safetensors=True)?

@Jeevi10
Copy link

Jeevi10 commented Aug 15, 2024

@yuekaizhang As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue. The wer problem was fixed. The root cause was language prompt. Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)
Method Latency (sec)
tensorrt-llm 0.21
faster-whisper 1.43
huggingface 1.7
openai 2.1
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?

where can I find model = AutoModel.from_pretrained(model_name, use_safetensors=True)?

Please check "/TensorRT-LLM/examples/whisper/distil_whisper/convert_from_distil_whisper.py"
line 59

@haiderasad
Copy link

@lionsheep24 for streaming purpose what was your analysis in terms of approach and results? wont 1 sec audio chunks hurt the accuracy of the transcriptions(as it is mentioned to use 30 secs chunks)?

@cuongkn
Copy link

cuongkn commented Jan 23, 2025

@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.

Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.

After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?

You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.

@yuekaizhang why using fp16 instead of fp32? I see that tensorrt-llm also support whisper with fp32 https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/reference/precision.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Investigating
Projects
None yet
Development

No branches or pull requests

7 participants