Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed TensorRT-LLM Benchmark #2694

Open
1 of 4 tasks
maulikmadhavi opened this issue Jan 15, 2025 · 0 comments
Open
1 of 4 tasks

Failed TensorRT-LLM Benchmark #2694

maulikmadhavi opened this issue Jan 15, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@maulikmadhavi
Copy link

maulikmadhavi commented Jan 15, 2025

System Info

  • CPU architecture: x86_64 (Linux node 6.5.0-25-generic )
  • CPU/Host memory size: 503GiB
  • GPU properties
    • GPU name: H100
    • GPU memory size: 80GB
  • Libraries
    • TensorRT-LLM branch or tag: v0.16.0
    • Container used:
  • NVIDIA driver version: 535.161.07
  • OS: Ubuntu 22.04

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow the steps on as per performance benchmarking link

  1. Generate synthetic data
python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt
  1. Build the model:
trtllm-bench --model meta-llama/Llama-2-7b-hf build --dataset /tmp/synthetic_128_128.txt --quantization FP8
  1. Run benchmark
trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Expected behavior

Running docker using make => make -C docker release_run

[01/15/2025-01:15:18] [TRT-LLM] [I] Stopping response parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Collecting last responses before shutdown.                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Completed request parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Parsing stopped.                                                                                                                                                                      
[01/15/2025-01:15:18] [TRT-LLM] [I] Request generator successfully joined.                                                                                                                                                
[01/15/2025-01:15:18] [TRT-LLM] [I] Statistics process successfully joined.                                                                                                                                               
[01/15/2025-01:15:18] [TRT-LLM] [I]                                                                                                                                                                                       
                                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
= ENGINE DETAILS                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
Model:                  meta-llama/Llama-2-7b-hf                                                                                                                                                                          
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1                                                                                                                                                           
TensorRT-LLM Version:   0.16.0                                                                               
Dtype:                  float16                                                                              
KV Cache Dtype:         FP8                                                                                  
Quantization:           FP8                                                                                  
Max Sequence Length:    256                                                                                  

===========================================================                                                  
= WORLD + RUNTIME INFORMATION                                                                                
===========================================================                                                  
TP Size:                1                                                                                    
PP Size:                1                                                                                    
Max Runtime Batch Size: 1280                                                                                 
Max Runtime Tokens:     2304                                                                                 
Scheduling Policy:      Guaranteed No Evict                                                                  
KV Memory Percentage:   90.00%                                                                               
Issue Rate (req/sec):   2.8149E+13                                                                           

===========================================================                                                  
= PERFORMANCE OVERVIEW                                                                                       
===========================================================                                                  
Number of requests:             3000                                                                         
Average Input Length (tokens):  128.0000                                                                     
Average Output Length (tokens): 128.0000                                                                     
Token Throughput (tokens/sec):  12067.8672                                                                   
Request Throughput (req/sec):   94.2802                                                                      
Total Latency (ms):             31820.0387                                                                   

===========================================================   

actual behavior

Running docker using docker => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 -v <path-to-TensorRT-LLM/>:/app/ 89fg611dcfd

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-10:23:14] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/15/2025-10:23:14] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/15/2025-10:23:14] [TRT-LLM] [I] Ready to start benchmark.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Stays here for quire longer time, upon Ctrl+C

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------

additional notes

=> Need to map port and directory to save time and repeated HF model downloads

Thanks

@maulikmadhavi maulikmadhavi added the bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant