Softmax #9

kroggen · 2024-11-11T23:14:50Z

In your code there is this:

        if normalize_type == 'softmax': 
            # NOTE: softmax = exp_l1_norm
            # outputs = F.softmax(inputs, dim=dim) * inputs.shape[dim]
            nonlinear_outputs = torch.exp(inputs)
            norm_outputs = nonlinear_outputs / torch.norm(nonlinear_outputs, p=1, dim=dim, keepdim=True) * inputs.shape[dim]
            outputs = norm_outputs

It turns out that softmax uses a division by the sum of exponentials:

softmax(x)_i = exp(x_i) / sum(exp(x_j))

But your code is using the sum of the absolute values.

The sum consider the sign of negative values:

 sum([-2, 1, 3]) = -2 + 1 + 3 = 2

While the L1 norm does not:

  L1([-2, 1, 3]) = |-2| + |1| + |3| = 2 + 1 + 3 = 6

The comment should be modified softmax = exp_l1_norm

It is also multiplying by the token dimension, on this case the sum of the attention scores is not 1.

The text was updated successfully, but these errors were encountered:

Haiyang-W · 2024-11-12T05:07:13Z

Thanks for your careful checking. Acturally, after exp(), the value is positive, so softmax is equal to exp_l1_norm.
I'm only multiplying by 'inputs.shape[dim]' here to balance the variance, so that we can achieve relatively good performance. If you remove this, the performance will be really bad.

Haiyang-W · 2024-11-12T05:10:44Z

Directly using Softmax without scaling by token dimension, the std will be very low, and the performance will be very poor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softmax #9

Softmax #9

kroggen commented Nov 11, 2024

Haiyang-W commented Nov 12, 2024

Haiyang-W commented Nov 12, 2024

Softmax #9

Softmax #9

Comments

kroggen commented Nov 11, 2024

Haiyang-W commented Nov 12, 2024

Haiyang-W commented Nov 12, 2024