Enable Vanilla Bwd and Refactor #86

micmelesse · 2024-10-14T22:59:01Z

This PR

enables vanilla bwd
refactors our code into several files.
fixes a bug with softmax lse that was returned with forward
adds an alternate mode that uses exp instead of exp2. This was useful to debug issues with both forward and backward
creates interfaces for fa and pytorch that use implementation functions with explicit paramaters.
adds a pytorch ref implementations that mimics our triton kernels for testing.

Vanilla BWD This is a combination of 79 commits. save test_flash_attn_output use impl functions pass layout add ref move arround impls fix stride issue save oai kernel add baseline impl save bwd kernel working remove old impl remove block_ptrs from bwd pass padded dmodel and apply masking. the old test cases work but cases with small d don't work save save more prints rename to M to L save add notes add old_bwd back fa failure fails in kernels too isolate new bwd and keep old bwd in place clean up softmax_lse doesnot match refernce LOG flag softmax_lse with LN2 move qk_scale to loop pass ln2 to fwd just print kernel input test softmax output from forward test exp_scores_triton save all the ref create ref USE_EXP2 path return scores mask scores when returning them. Basic impl test passes scores and output match show max_diff return score needs to be adjusted as we find new maxes all good outputs. old style RCP2 example prep bwd_impl test save try openai save fix softmax_lse bug test_op_bwd_impl starting to work! new kernel. exp2 works but exp is faliing fix bwd exp2 add m and n masks. small cases still don't work match old and new kernel prints compare old and new print inputs save old kernel match on dv dq works compare to pytorch including softmax in forward fix bwd impl bug small sizes in bwd impl work old bwd test pass. Moving on to kernel tests dq, dk and dv are filled in place if given. Need to match cast to match fa fix non bug fix dv mismatch. use_exp2 was set to true in fwd fix case up 128 refactor and clean up a bit more issue is that dq and dk are not zeros dq must be zeroed out ignore segfaults fa ref and my ref match! all tests run use tolerance 1e-3 we need to figure out preprocessing save clean up save test delta diff move old impl out new preprocess function preprocessing_use_o flag working _bwd_preprocess_use_p basic cases pass all green fwd exp2 usage is done right before exp

micmelesse · 2024-10-15T00:09:13Z

The kernel tests pass on MI300 but seems the ci MI200 have issues.

Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save

micmelesse added 10 commits October 14, 2024 13:56

refactor

5e76b58

refactor 2

a9e4851

refactor 3

2f95807

fix bug

6182013

try ci

0140baf

add flag

5e54821

rename to utils

d005ab9

skip test_op_fwd_decode_int4_kv

22342a7

reduce head size

51b2e8e

micmelesse added 2 commits October 14, 2024 19:15

try again

5071356

go back to old head sizes

ce80a7e

micmelesse force-pushed the micmelesse/enable_bwd branch from c6c9559 to 0bd8120 Compare October 16, 2024 16:18

Use Strides

a168999

Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save

micmelesse force-pushed the micmelesse/enable_bwd branch from 8d1c2f7 to a168999 Compare October 16, 2024 16:40

micmelesse added 14 commits October 16, 2024 11:56

use gen scripts

a1c7674

varlen fwd passing

1a761a0

core fwd ref impl

e5ee0f8

fix minor bugs

b9b1f24

wrap varlen- launcher attention_forward_pytorch_ref_impl

52f6bdc

varlen backward ref added

ed1e4fe

add offsets for varlen

a86af8a

fix delta bug

72cec14

varlen bwd working

bc25ca2

save

e5e5307

runs on Mi200

94fbe0e

just test basics

79871d7

save

0de209c

fix bug

2db6502

micmelesse added 25 commits October 19, 2024 22:48

fix varlen in64 bug

d192946

add ref

c8274bb

test_impl working with causal

cdc916f

fix qkvpacked issue

550b2ba

qkvpacked run tests

f0c782d

remove test_backward

2169bc3

save

da22a8d

just test output

1fa1124

dump into tensors

290a922

softmaxlse layout for varlen

4e05a5f

small cases working

695dbc3

bwd thd green. although maybe some oom

b5d663c

forward out and lse are good. Something wrong with backward ref

bab936b

make varlen ref work

ca2670c

save work, ref is working mostly

628fbfd

91 failed, 6542 passed, 6336 skipped, 1 warning

6e2ed2c

ref is all green

e1395fb

debug flag in utils

b762787

found bad softmax_lse in varlen fwd

0421893

fix bug in softmax lse. strides in varlen werenot right

1e70ab9

add causal tests and 32*32 bwd doesnot have segfault

a676a4b

save

997db5c

fix oom by reducing block size for small heads

18707e0

bwd ref with causal working

e3d8fb1

test impl

953bddc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Vanilla Bwd and Refactor #86

Enable Vanilla Bwd and Refactor #86

micmelesse commented Oct 14, 2024 •

edited

Loading

micmelesse commented Oct 15, 2024 •

edited

Loading

Enable Vanilla Bwd and Refactor #86

Are you sure you want to change the base?

Enable Vanilla Bwd and Refactor #86

Conversation

micmelesse commented Oct 14, 2024 • edited Loading

micmelesse commented Oct 15, 2024 • edited Loading

micmelesse commented Oct 14, 2024 •

edited

Loading

micmelesse commented Oct 15, 2024 •

edited

Loading