fused_linear_cross_entropy: Move float32 cast into kernel #238

hansonw · 2024-09-09T23:41:17Z

Summary

Another small optimization :) The logits_chunk.float() allocation may be surprisingly large, e.g. Cohere models have 256K vocabs, so each logit chunk in float32 could be something like 1024 * 256K * 4 = 1GB VRAM (even more if the chunk size is larger.)

I actually don't think any explicit casting is even required within the Triton kernel since the intermediate softmax calculation variables like m, d, etc. are already float32 by default, so with type promotion the calculations should all be float32 regardless.

However, I added explicit casts .cast(tl.float32) around all of the X_ptr loads to make this more obvious to the reader. In either case, the actual liger_cross_entropy_kernel runs so quickly that I don't think there's any performance difference - this is purely to save the float32 allocation. (It might be more efficient without the explicit casts, but I was not able to measure anything - even with a 1K x 256K logit matrix the kernel kind of runs instantly lol.)

Testing Done

Hardware Type: A100 80GB
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

ByronHsu · 2024-09-10T00:29:26Z

thanks! this makes sense. i did try the similar thing before but seen divergence compared with casting from the torch side (not sure why, maybe i did it wrong). also, currently bfloat16 convergence test is not actually tested due to #176. after the fix is merged, we can try to run on convergence tests with bf16 to see if there is any gap.

hansonw · 2024-09-10T01:02:08Z

I added a test_float32_internal() unit test which runs the kernel twice, once with a bfloat16 input and once with a float-upcasted version, and verifies that the resulting output (in bfloat16) is exactly identical 🙂 You can also verify that the test passes even without the explicit .cast(tl.float32) calls, so maybe those could be removed as long as the test is present..

ByronHsu · 2024-09-10T17:48:08Z

cool! I will take a deeper look today or tomorrow. This is exciting!

kostum123 · 2024-09-30T08:04:20Z

Can we merge this? in current form liger kernel broken.

lancerts · 2024-10-01T23:59:40Z

@hansonw can we resolve the conflict? ty

ByronHsu · 2024-10-02T20:13:55Z

We can merge this once the conflict is resolved. thanks!!

fused_linear_cross_entropy: Move float32 cast into kernel

4f9ffaf

fused_linear_cross_entropy: add test_float32_internal test

648dc3c

ByronHsu mentioned this pull request Sep 10, 2024

Reasons for upcasting the logits dtype outside the kernel #241

Open

lancerts added 2 commits September 10, 2024 16:36

Merge branch 'main' into hw-fused-ce-float

34acb82

Merge branch 'main' into hw-fused-ce-float

9c129e3

Tcc0403 mentioned this pull request Sep 13, 2024

Compatibility Issue: PEFT and BitsAndBytesConfig with Liger Kernel. Seeking Alternatives for Quantization and LoRA Fine-Tuning. #235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fused_linear_cross_entropy: Move float32 cast into kernel #238

fused_linear_cross_entropy: Move float32 cast into kernel #238

hansonw commented Sep 9, 2024 •

edited

Loading

ByronHsu commented Sep 10, 2024 •

edited

Loading

hansonw commented Sep 10, 2024 •

edited

Loading

ByronHsu commented Sep 10, 2024

kostum123 commented Sep 30, 2024

lancerts commented Oct 1, 2024

ByronHsu commented Oct 2, 2024

fused_linear_cross_entropy: Move float32 cast into kernel #238

Are you sure you want to change the base?

fused_linear_cross_entropy: Move float32 cast into kernel #238

Conversation

hansonw commented Sep 9, 2024 • edited Loading

Summary

Testing Done

ByronHsu commented Sep 10, 2024 • edited Loading

hansonw commented Sep 10, 2024 • edited Loading

ByronHsu commented Sep 10, 2024

kostum123 commented Sep 30, 2024

lancerts commented Oct 1, 2024

ByronHsu commented Oct 2, 2024

hansonw commented Sep 9, 2024 •

edited

Loading

ByronHsu commented Sep 10, 2024 •

edited

Loading

hansonw commented Sep 10, 2024 •

edited

Loading