You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run a qwen2-7b-instruct with AWQ quantized in a kubernetes environment. GPU is single T4 (16 GB VRAM).
I see that it is unable to use Flash attention v2 on T4 and moves forward to use FA 1.
2024-07-03T12:58:57.481034Z WARN lorax_launcher: flash_attn.py:111 Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
But then subsequently fails with the following error. I believe, this error arises when CUDA is not compiled for target architecure? But I am not sure what is the incompatibility here.
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T12:59:01.743978Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
Running the latest container , ghcr.io/predibase/lorax:main (as of today 3 July 2024)
root@lorax-llm-mistal-7b-744b9dfc47-zbfw8:/usr/src# CUDA_LAUNCH_BLOCKING=1
root@lorax-llm-mistal-7b-744b9dfc47-zbfw8:/usr/src# lorax-launcher --model-id Qwen/Qwen2-7B-Instruct-AWQ --quantize awq --max-concurrent-requests 5 --max-batch-prefill-tokens 1024
2024-07-03T12:58:50.820101Z INFO lorax_launcher: Args { model_id: "Qwen/Qwen2-7B-Instruct-AWQ", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, embedding_model: None, num_shard: None, quantize: Some(Awq), compile: false, speculative_tokens: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 5, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 1024, max_batch_total_tokens: None, max_waiting_tokens: 20, eager_prefill: None, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "lorax-llm-mistal-7b-744b9dfc47-zbfw8", port: 8000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-07-03T12:58:50.820260Z INFO download: lorax_launcher: Starting download process.
2024-07-03T12:58:53.033927Z INFO lorax_launcher: weights.py:448 Files are already present on the host. Skipping download.
2024-07-03T12:58:53.522905Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-07-03T12:58:53.523144Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-07-03T12:58:57.481034Z WARN lorax_launcher: flash_attn.py:111 Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
2024-07-03T12:59:00.386449Z INFO lorax_launcher: server.py:314 SGMV kernel is enabled, multi-LoRA inference will be fast!
2024-07-03T12:59:00.386539Z INFO lorax_launcher: server.py:318 Server started at unix:///tmp/lorax-server-0
2024-07-03T12:59:00.431056Z INFO shard-manager: lorax_launcher: Shard ready in 6.907243348s rank=0
2024-07-03T12:59:00.531117Z INFO lorax_launcher: Starting Webserver
2024-07-03T12:59:00.537253Z INFO lorax_router: router/src/main.rs:208: Loading tokenizer Qwen/Qwen2-7B-Instruct-AWQ
2024-07-03T12:59:00.537283Z INFO lorax_router: router/src/main.rs:228: Using the Hugging Face API
2024-07-03T12:59:00.537300Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-03T12:59:00.864128Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '151643' but was given ID 'None'
2024-07-03T12:59:00.864153Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_start|>' was expected to have ID '151644' but was given ID 'None'
2024-07-03T12:59:00.864157Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_end|>' was expected to have ID '151645' but was given ID 'None'
2024-07-03T12:59:00.866738Z WARN lorax_router: router/src/main.rs:451: `--revision` is not set
2024-07-03T12:59:00.866755Z WARN lorax_router: router/src/main.rs:452: We strongly advise to set it to a known supported commit.
2024-07-03T12:59:01.038552Z INFO lorax_router: router/src/main.rs:473: Serving revision 94e886385b1e3826eafc05e13e6d4a9d803da1d7 of model Qwen/Qwen2-7B-Instruct-AWQ
2024-07-03T12:59:01.049746Z INFO lorax_router: router/src/main.rs:302: Warming up model
2024-07-03T12:59:01.091959Z INFO lorax_launcher: flash_causal_lm.py:769 Warming up to max_new_tokens: 1024
2024-07-03T12:59:01.743383Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 326, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 77, in Warmup
max_supported_total_tokens = self.model.warmup(batch, request.max_new_tokens)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 773, in warmup
_, batch = self.generate_token(batch, is_warmup=True)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 962, in generate_token
out, speculative_logits = self._try_generate_token(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 919, in _try_generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 916, in _try_generate_token
return self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 895, in forward
logits = model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen2_modeling.py", line 488, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen2_modeling.py", line 427, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen2_modeling.py", line 358, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen2_modeling.py", line 214, in forward
qkv = self.query_key_value(hidden_states, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 180, in forward
result = self.base_layer(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/layers/tensor_parallel.py", line 15, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/awq/awq.py", line 39, in forward
out = awq_inference_engine.gemm_forward_cuda(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T12:59:01.743978Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-07-03T12:59:01.832926Z ERROR lorax_launcher: Webserver Crashed
2024-07-03T12:59:01.832950Z INFO lorax_launcher: Shutting down shards
2024-07-03T12:59:02.004831Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Just running the launcher with either mistral or qwen2 7b quantized models.
Expected behavior
Given this PR (#480) is merged, I was expecting this to run with flash attention v1 on T4
The text was updated successfully, but these errors were encountered:
System Info
I am trying to run a qwen2-7b-instruct with AWQ quantized in a kubernetes environment. GPU is single T4 (16 GB VRAM).
I see that it is unable to use Flash attention v2 on T4 and moves forward to use FA 1.
But then subsequently fails with the following error. I believe, this error arises when CUDA is not compiled for target architecure? But I am not sure what is the incompatibility here.
Running the latest container , ghcr.io/predibase/lorax:main (as of today 3 July 2024)
lorax-launcher --env output
Detailed logs
Information
Tasks
Reproduction
Just running the launcher with either mistral or qwen2 7b quantized models.
Expected behavior
Given this PR (#480) is merged, I was expecting this to run with flash attention v1 on T4
The text was updated successfully, but these errors were encountered: