You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the latter approach is more flexible since I can adjust parameters like the SmoothQuant ratio, per_token, and other options.
Does the first command offer broader compatibility, while the latter is restricted to models that specifically use convert_checkpoint.py? So, when a model has a corresponding convert_checkpoint.py file, I should prioritize using it first.
Furthermore, I noticed that both commands generate safetensors and a config.json. Is it possible to use quantize.py to generate config.json and manually modify the quantization-related fields afterward?
The text was updated successfully, but these errors were encountered:
Hi @XA23i , the first commands relies on the kernel selected/generated by TensorRT while the latter command depends on the smoothquant plugin.
Is it possible to use quantize.py to generate config.json and manually modify the quantization-related fields afterward?
===============================
I don't think so since the config.json reflects the content of safetensor. We can't modify one while keep another one as usual.
Maybe you cite the specific scenario and we can figure out the best way to solve the problem.
if I use the first command, what are the default settings for SmoothQuant, particularly the smoothquant ratio? It seems there is no such description in config.json.
I've noticed that I can apply SmoothQuant to models using the command:
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH
in quantize.py. Additionally, I can also achieve this by running:
python3 convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
--output_dir ./tllm_checkpoint_1gpu_sq
--dtype float16
--smoothquant 0.5
--per_token
--per_channel
It seems that the latter approach is more flexible since I can adjust parameters like the SmoothQuant ratio, per_token, and other options.
Does the first command offer broader compatibility, while the latter is restricted to models that specifically use convert_checkpoint.py? So, when a model has a corresponding convert_checkpoint.py file, I should prioritize using it first.
Furthermore, I noticed that both commands generate safetensors and a config.json. Is it possible to use quantize.py to generate config.json and manually modify the quantization-related fields afterward?
The text was updated successfully, but these errors were encountered: