Failed TensorRT-LLM Benchmark #2694

maulikmadhavi · 2025-01-15T10:41:11Z

System Info

CPU architecture: x86_64 (Linux node 6.5.0-25-generic )
CPU/Host memory size: 503GiB
GPU properties
- GPU name: H100
- GPU memory size: 80GB
Libraries
- TensorRT-LLM branch or tag: v0.16.0
- Container used:
NVIDIA driver version: 535.161.07
OS: Ubuntu 22.04

Who can help?

Documentation: @juney-nvidia
Others: @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Follow the steps on as per performance benchmarking link

Generate synthetic data

python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt

Build the model:

trtllm-bench --model meta-llama/Llama-2-7b-hf build --dataset /tmp/synthetic_128_128.txt --quantization FP8

Run benchmark

trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Expected behavior

Running docker using make => make -C docker release_run

[01/15/2025-01:15:18] [TRT-LLM] [I] Stopping response parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Collecting last responses before shutdown.                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Completed request parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Parsing stopped.                                                                                                                                                                      
[01/15/2025-01:15:18] [TRT-LLM] [I] Request generator successfully joined.                                                                                                                                                
[01/15/2025-01:15:18] [TRT-LLM] [I] Statistics process successfully joined.                                                                                                                                               
[01/15/2025-01:15:18] [TRT-LLM] [I]                                                                                                                                                                                       
                                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
= ENGINE DETAILS                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
Model:                  meta-llama/Llama-2-7b-hf                                                                                                                                                                          
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1                                                                                                                                                           
TensorRT-LLM Version:   0.16.0                                                                               
Dtype:                  float16                                                                              
KV Cache Dtype:         FP8                                                                                  
Quantization:           FP8                                                                                  
Max Sequence Length:    256                                                                                  

===========================================================                                                  
= WORLD + RUNTIME INFORMATION                                                                                
===========================================================                                                  
TP Size:                1                                                                                    
PP Size:                1                                                                                    
Max Runtime Batch Size: 1280                                                                                 
Max Runtime Tokens:     2304                                                                                 
Scheduling Policy:      Guaranteed No Evict                                                                  
KV Memory Percentage:   90.00%                                                                               
Issue Rate (req/sec):   2.8149E+13                                                                           

===========================================================                                                  
= PERFORMANCE OVERVIEW                                                                                       
===========================================================                                                  
Number of requests:             3000                                                                         
Average Input Length (tokens):  128.0000                                                                     
Average Output Length (tokens): 128.0000                                                                     
Token Throughput (tokens/sec):  12067.8672                                                                   
Request Throughput (req/sec):   94.2802                                                                      
Total Latency (ms):             31820.0387                                                                   

===========================================================

actual behavior

Running docker using docker => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 -v <path-to-TensorRT-LLM/>:/app/ 89fg611dcfd

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-10:23:14] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/15/2025-10:23:14] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/15/2025-10:23:14] [TRT-LLM] [I] Ready to start benchmark.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Stays here for quire longer time, upon Ctrl+C

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------

additional notes

=> Need to map port and directory to save time and repeated HF model downloads

Thanks

The text was updated successfully, but these errors were encountered:

maulikmadhavi added the bug Something isn't working label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed TensorRT-LLM Benchmark #2694

Failed TensorRT-LLM Benchmark #2694

maulikmadhavi commented Jan 15, 2025 •

edited

Loading

Failed TensorRT-LLM Benchmark #2694

Failed TensorRT-LLM Benchmark #2694

Comments

maulikmadhavi commented Jan 15, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

maulikmadhavi commented Jan 15, 2025 •

edited

Loading