Why transformer hidden layer use softmax as activation function

WJC · September 17, 2021, 4:49am

Softmax layer Clarification - Deep Learning Specialization / DLS Course 2 - DeepLearning.AI
In the above topic, we said that softmax is used in the output layer, not the hidden layer.

However, In the transformer, multihead attention is a hidden layer, and in multihead attention, QK^T passes through a softmax function… Why?

jonaslalin · September 30, 2021, 2:07pm

Because attention is a weighted average and the attention weights should therefore sum to one. Softmax ensures that.

Screen Shot 2021-09-30 at 16.07.32

Topic		Replies	Views
Softmax layer Clarification Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	553	August 7, 2021
Programming Assigment: Softmax activation is not applied Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	508	January 2, 2022
RNN FeedForward - Need for Weight and Bias matrix before softmax on activation for y hat Sequence Models coursera-platform	2	403	July 28, 2023
Why ReLU and softmax? NLP with Probabilistic Models week-module-4	1	613	November 2, 2021
Summation in self-attention Sequence Models coursera-platform	3	569	September 17, 2021