You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experiencing an input length limitation issue when running Qwen2.5-7B-instruct model with TensorRT-LLM on a single H100 GPU. While the model inherently supports a 32k context window, the system throws an error when the input length exceeds 8192 tokens.
Environment
Model: Qwen2.5-7B-instruct
Framework: TensorRT-LLM (version 0.16.0)
Hardware: Single H100 GPU
Maximum supported model context length: 32k tokens
Current limitation: 8192 tokens
Error Log
Loading Model: [1/2] Loading HF model to memory
230it [00:09, 25.32it/s]
Time: 9.662s
Loading Model: [2/2] Building TRT-LLM engine
Time: 112.563s
Loading model done.
Total latency: 122.225s
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 14553 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1272.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14541 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.66 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.77 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 53.90 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14192
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 48.51 GiB for max tokens in paged KV cache (908288).
Evaluating responses...
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Prompt length (12438) exceeds maximum input length (8192). Set log level to info and check TRTGptModel logs for how maximum input length is set (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:482)
1 0x7f2312258d0d /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x868d0d) [0x7f2312258d0d]
2 0x7f23145b722d tensorrt_llm::executor::Executor::Impl::executionLoop() + 1021
3 0x7f271ffee5c0 /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/torch/lib/libtorch.so(+0x145c0) [0x7f271ffee5c0]
4 0x7f272e00bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f272e00bac3]
5 0x7f272e09da40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f272e09da40]
The text was updated successfully, but these errors were encountered:
Description
I'm experiencing an input length limitation issue when running Qwen2.5-7B-instruct model with TensorRT-LLM on a single H100 GPU. While the model inherently supports a 32k context window, the system throws an error when the input length exceeds 8192 tokens.
Environment
Error Log
The text was updated successfully, but these errors were encountered: