[image] Softmax layer Clarification - Deep Learning Specialization / DLS Course 2 - DeepLearning.AI In the above topic, we said that softmax is used in the output layer, not the hidden layer. However, In the transformer, multihead attention is a hidden layer, and in multihead attention, QK^T pas…

Why transformer hidden layer use softmax as activation function

jonaslalin September 30, 2021, 2:07pm 2

Because attention is a weighted average and the attention weights should therefore sum to one. Softmax ensures that.

Screen Shot 2021-09-30 at 16.07.32

1 Like

Topic		Replies	Views
Softmax layer Clarification Improving Deep Neural Networks: Hyperparameter tun	1	551	August 7, 2021
Transformer question: what is this v-value? Sequence Models	2	679	May 4, 2022
Why softmax in last layer for multiclass NN? Improving Deep Neural Networks: Hyperparameter tun	5	563	January 7, 2022
Softmax layer at last layer Neural Networks and Deep Learning	1	523	April 15, 2022
Why ReLU and softmax? NLP with Probabilistic Models week-4	1	605	November 2, 2021