Jcaip/llm bsr #1601

jcaip · 2025-01-22T21:43:43Z

This PR promotes Supermask and block sparsity from prototype -> torchao.sparsity

It adds a new public API for SupermaskLinear, which users can use to add Supermask to their models with

sparsify_(model, lambda x: SupermaskLinear.to_dense(x, sparsity_level=0.9)

I have also modified all the existing supermask sam testing code to use this new API.

It also ports over the triton addmm kernels from core, to let us modify them as needed. I've added padding support into the triton kernel, which was a 4 tok/s improvement (214 -> 218).

Adds padding to BSR

On a H100 benchmarking with the following commands

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder
export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --profile baseline_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --sparsity bsr --profile bsr_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --profile baseline
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --sparsity bsr --profile bsr

yields a 134 -> 218 tok/s improvemnt on LLM decoding.

pytorch-bot · 2025-01-22T21:43:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1601

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit b414b49 with merge base 11333ba ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.3, linux.4xlarge, torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t b15ed22901eb38c92841e1e0d8862c31e47a20ab145d4b703bbf9be7a7aa0163 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.4, linux.4xlarge, torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8
Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 1a9dea5fa0c6b4dc5aca4a9edbae00ece5687fd1126e1a10177c0c53622ec2c0 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch==2.7.0.dev20250122 --index-url https://down... / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch==2.7.0.dev20250122 --index-... / linux-job (gh)
test/sparsity/test_bsr.py::TestSupermask::test_supermask_sparsity_level_0_5_blocksize_8

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jcaip · 2025-01-23T19:53:24Z

torchao/prototype/sparsity/superblock/bsr_triton_ops.py

+            offsets = tl.arange(0, 16)[None, :]
+            dense_block = tl.load(
+                dense_block_ptrs + dense_tiled_row_stride * dense_row_idx,
+                mask=offsets < BLOCKSIZE_COL,


cc @cpuhrsch masking added in here for the padding

jcaip · 2025-01-23T19:53:51Z

torchao/prototype/sparsity/superblock/bsr_triton_ops.py

+        row_block_arange = tl.arange(0, BLOCKSIZE_ROW)
+        inner_block_arange = tl.arange(0, BLOCKSIZE_INNER)
+
+        if BLOCKSIZE_COL < 16 or BLOCKSIZE_COL % 16 != 0:


This is the padding logic (need to do this properly instead of hardcoding 16)

jcaip · 2025-01-23T19:55:32Z

torchao/prototype/sparsity/superblock/blocksparse.py

+
+
+@implements(aten.sum.dim_IntList)
+def block_sparse_sum(func, types, args, kwargs):


cc @cpuhrsch This computes the sum properly for the fast path reduction, but doesn't work with compile because of L300 temp_sum = bsr.values()[start:stop] which errors out on data dependent flow.

I think we can add a new kernel to the bsr_dense_addmm implementation to handle the fast path there instead, and rewrite this using triton

Right, so here it's useful to view crow_indices and values as a NestedTensor and then use sum from there :) This is possible because values + crow_indices is like values + offsets.

jcaip added 4 commits January 21, 2025 11:38

wip

9b18913

cleaned up supermask

19aa5d0

cleanup

c1616c4

update

2e78fc3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 22, 2025

added padding to triton kernel

bd3a3b1

jcaip commented Jan 23, 2025

View reviewed changes

jcaip added 11 commits January 23, 2025 15:29

wip

44985d2

wip

b32bdfb

wip

5e25b8b

added tests

fe655a2

cleaned up BSR code

b12df57

update generate.py

6df43e0

wip

d7fd295

wip

13e230c

updated

560198f

moved file

89f3ad0

big supermask refactor

b414b49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jcaip/llm bsr #1601

Jcaip/llm bsr #1601

jcaip commented Jan 22, 2025 •

edited

Loading

pytorch-bot bot commented Jan 22, 2025 •

edited

Loading

jcaip Jan 23, 2025

jcaip Jan 23, 2025

jcaip Jan 23, 2025

cpuhrsch Jan 24, 2025



		@implements(aten.sum.dim_IntList)
		def block_sparse_sum(func, types, args, kwargs):

Jcaip/llm bsr #1601

Are you sure you want to change the base?

Jcaip/llm bsr #1601

Conversation

jcaip commented Jan 22, 2025 • edited Loading

pytorch-bot bot commented Jan 22, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1601

❌ 10 New Failures

jcaip Jan 23, 2025

Choose a reason for hiding this comment

jcaip Jan 23, 2025

Choose a reason for hiding this comment

jcaip Jan 23, 2025

Choose a reason for hiding this comment

cpuhrsch Jan 24, 2025

Choose a reason for hiding this comment

jcaip commented Jan 22, 2025 •

edited

Loading

pytorch-bot bot commented Jan 22, 2025 •

edited

Loading