Skip to content

v0.2.3

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 26 Jan 22:13
· 3 commits to main since this release

What's Changed

  • misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
  • upgrade pytorch to 2.5.1 by @guocuimi in #351
  • added cuda 12.6 build image by @guocuimi in #353
  • fix cmake version issue for manylinux image by @guocuimi in #354
  • kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
  • ci: fix package test workflow by @guocuimi in #357
  • kernel: refactor attention kernel for readibility by @guocuimi in #358
  • dev: config dev container with proper extensions by @guocuimi in #359
  • kernel: added attention bench for profiling before optimization by @guocuimi in #360
  • kernel: added logits soft cap support for attention by @guocuimi in #362
  • tools: added attention traits viewer by @guocuimi in #363
  • kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
  • kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
  • kernel: refactor attention kernel and add more unittests by @guocuimi in #366
  • kernel: added M/N OOB handling for attention by @guocuimi in #367
  • tools: update svg build to generate small file by @guocuimi in #368
  • kernel: Added attention params and tile for different input types. by @guocuimi in #369
  • kernel: added mqa and gqa support for attention by @guocuimi in #370
  • kernel: added var len and paged kv cache support for attention by @guocuimi in #371
  • kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
  • kernel: added attention kernel launch by @guocuimi in #373
  • kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
  • kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
  • kernel: added head_dim=96 support for attention by @guocuimi in #376
  • kernel: optimize attention kernel performance by @guocuimi in #377
  • upgrade cutlass to 3.7.0 by @guocuimi in #379
  • kernel: handle kv block range for attention kernel by @guocuimi in #382
  • kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
  • kernel: seperate oob iterations for better performance. by @guocuimi in #384
  • refactor: remove batch_prefill interface by @guocuimi in #385
  • refactor: stop build flash_infer kernel by @guocuimi in #386
  • feat: integrate in-house scale attention and use it by default by @guocuimi in #380
  • kernel: only zfill k once to improve perf for attention by @guocuimi in #387
  • refactor: skip flash_attn build by @guocuimi in #388
  • refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3