What's Changed
- misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
- upgrade pytorch to 2.5.1 by @guocuimi in #351
- added cuda 12.6 build image by @guocuimi in #353
- fix cmake version issue for manylinux image by @guocuimi in #354
- kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
- ci: fix package test workflow by @guocuimi in #357
- kernel: refactor attention kernel for readibility by @guocuimi in #358
- dev: config dev container with proper extensions by @guocuimi in #359
- kernel: added attention bench for profiling before optimization by @guocuimi in #360
- kernel: added logits soft cap support for attention by @guocuimi in #362
- tools: added attention traits viewer by @guocuimi in #363
- kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
- kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
- kernel: refactor attention kernel and add more unittests by @guocuimi in #366
- kernel: added M/N OOB handling for attention by @guocuimi in #367
- tools: update svg build to generate small file by @guocuimi in #368
- kernel: Added attention params and tile for different input types. by @guocuimi in #369
- kernel: added mqa and gqa support for attention by @guocuimi in #370
- kernel: added var len and paged kv cache support for attention by @guocuimi in #371
- kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
- kernel: added attention kernel launch by @guocuimi in #373
- kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
- kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
- kernel: added head_dim=96 support for attention by @guocuimi in #376
- kernel: optimize attention kernel performance by @guocuimi in #377
- upgrade cutlass to 3.7.0 by @guocuimi in #379
- kernel: handle kv block range for attention kernel by @guocuimi in #382
- kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
- kernel: seperate oob iterations for better performance. by @guocuimi in #384
- refactor: remove batch_prefill interface by @guocuimi in #385
- refactor: stop build flash_infer kernel by @guocuimi in #386
- feat: integrate in-house scale attention and use it by default by @guocuimi in #380
- kernel: only zfill k once to improve perf for attention by @guocuimi in #387
- refactor: skip flash_attn build by @guocuimi in #388
- refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389
Full Changelog: v0.2.2...v0.2.3