Release v0.2.3 · vectorch-ai/ScaleLLM

What's Changed

misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
upgrade pytorch to 2.5.1 by @guocuimi in #351
added cuda 12.6 build image by @guocuimi in #353
fix cmake version issue for manylinux image by @guocuimi in #354
kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
ci: fix package test workflow by @guocuimi in #357
kernel: refactor attention kernel for readibility by @guocuimi in #358
dev: config dev container with proper extensions by @guocuimi in #359
kernel: added attention bench for profiling before optimization by @guocuimi in #360
kernel: added logits soft cap support for attention by @guocuimi in #362
tools: added attention traits viewer by @guocuimi in #363
kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
kernel: refactor attention kernel and add more unittests by @guocuimi in #366
kernel: added M/N OOB handling for attention by @guocuimi in #367
tools: update svg build to generate small file by @guocuimi in #368
kernel: Added attention params and tile for different input types. by @guocuimi in #369
kernel: added mqa and gqa support for attention by @guocuimi in #370
kernel: added var len and paged kv cache support for attention by @guocuimi in #371
kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
kernel: added attention kernel launch by @guocuimi in #373
kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
kernel: added head_dim=96 support for attention by @guocuimi in #376
kernel: optimize attention kernel performance by @guocuimi in #377
upgrade cutlass to 3.7.0 by @guocuimi in #379
kernel: handle kv block range for attention kernel by @guocuimi in #382
kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
kernel: seperate oob iterations for better performance. by @guocuimi in #384
refactor: remove batch_prefill interface by @guocuimi in #385
refactor: stop build flash_infer kernel by @guocuimi in #386
feat: integrate in-house scale attention and use it by default by @guocuimi in #380
kernel: only zfill k once to improve perf for attention by @guocuimi in #387
refactor: skip flash_attn build by @guocuimi in #388
refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.3

What's Changed

Contributors