[WIP] speed up CodebookQuantizedTensor inference #1607

DerekLiu35 · 2025-01-23T17:07:12Z

This PR tries to speedup inference for #1195

Currently I've ported code1x16 matrix multiplication kernels from AQLM to torchao

Usage

Demo notebook with preliminary tests

Preliminary tests show a ~2x speedup for linear layer compared to the fallback implementation.
Matches AQLM's performance for (1, 8) block sizes
quantization error of kernel seems much higher than fallback implementation, need to investigate more

ToDo

Improve performance for block sizes of (1, 1).
Add triton fall back for other block sizes
Add tests.
Update benchmarks

Would appreciate feedback on approach and preliminary results
@jerryzh168 @pawarmanasi07

pytorch-bot · 2025-01-23T17:07:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1607

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

add code1x16 kernel

9288321

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 23, 2025

fix lint

ee09e00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] speed up CodebookQuantizedTensor inference #1607

[WIP] speed up CodebookQuantizedTensor inference #1607

DerekLiu35 commented Jan 23, 2025

pytorch-bot bot commented Jan 23, 2025

[WIP] speed up CodebookQuantizedTensor inference #1607

Are you sure you want to change the base?

[WIP] speed up CodebookQuantizedTensor inference #1607

Conversation

DerekLiu35 commented Jan 23, 2025

Usage

ToDo

pytorch-bot bot commented Jan 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1607