Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multihead arch cutlass int8 qkv average #2035

Draft
wants to merge 89 commits into
base: master
Choose a base branch
from

Conversation

almaudoh
Copy link
Contributor

@almaudoh almaudoh commented Jun 2, 2024

This is a temporary PR to allow testing of a branch of cuda int8 that may not eventually be merged.

ankan-ban and others added 30 commits March 22, 2022 22:28
- skip connection add before layer norm now has a scaling factor (alpha)
 - replace conv layer of value and mlh heads with an embedding layer when attention body is used.
- will be removed once it's fixed.
- also fix scratch space calculation.
 - factor of sizeof(DataType) was missing.
- to handle bigger/wider networks
1.3% improvement in BT2 on RTX 4090
15.6% improvement in test BT3 network with 64 heads.
- only tries doing the KQV dense layers in int8.
- Accuracy seems reasonable.
- Right now quantization isn't fused, and de-quantization is done with bias add.
- Both the above can be possibly be fused with more work.
- Also need to attempt INT8 for other dense layers (MHA dense, FFN1 and FFN2)
almaudoh-1 and others added 30 commits March 2, 2024 17:50
…rnels for clipping of inputs for non-int8 inference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants