Why transformer hidden layer use softmax as activation function

Because attention is a weighted average and the attention weights should therefore sum to one. Softmax ensures that.

Screen Shot 2021-09-30 at 16.07.32

1 Like