Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Softmax #9

Open
kroggen opened this issue Nov 11, 2024 · 2 comments
Open

Softmax #9

kroggen opened this issue Nov 11, 2024 · 2 comments

Comments

@kroggen
Copy link

kroggen commented Nov 11, 2024

In your code there is this:

        if normalize_type == 'softmax': 
            # NOTE: softmax = exp_l1_norm
            # outputs = F.softmax(inputs, dim=dim) * inputs.shape[dim]
            nonlinear_outputs = torch.exp(inputs)
            norm_outputs = nonlinear_outputs / torch.norm(nonlinear_outputs, p=1, dim=dim, keepdim=True) * inputs.shape[dim]
            outputs = norm_outputs

It turns out that softmax uses a division by the sum of exponentials:

softmax(x)_i = exp(x_i) / sum(exp(x_j))

But your code is using the sum of the absolute values.

The sum consider the sign of negative values:

 sum([-2, 1, 3]) = -2 + 1 + 3 = 2

While the L1 norm does not:

  L1([-2, 1, 3]) = |-2| + |1| + |3| = 2 + 1 + 3 = 6

The comment should be modified softmax = exp_l1_norm

It is also multiplying by the token dimension, on this case the sum of the attention scores is not 1.

@Haiyang-W
Copy link
Owner

Thanks for your careful checking. Acturally, after exp(), the value is positive, so softmax is equal to exp_l1_norm.
I'm only multiplying by 'inputs.shape[dim]' here to balance the variance, so that we can achieve relatively good performance. If you remove this, the performance will be really bad.

@Haiyang-W
Copy link
Owner

Directly using Softmax without scaling by token dimension, the std will be very low, and the performance will be very poor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants