RNN FeedForward - Need for Weight and Bias matrix before softmax on activation for y hat

AnilBhat · July 28, 2023, 11:22am

Why do we need to multiply activation output with weight matrix and add bias matrix before applying softmax ?
Why not directly apply softmax on activation output ?
Can we train a model like this ?

balaji.ambresh · July 28, 2023, 12:13pm

While you can apply softmax on top of tanh, I’ve not seen 2 activations in succession on any NN. This operation in the diagram is like having a dense layer between the output of tanh layer and the final output.

As far as tanh is concerned, it was used widely as an activation function in hidden layers before relu became popular. In this case, the using tanh keeps outputs range [-1, 1] which should prevent activations growing large if only a linear activation was used.

The architecture was proposed after evaluating the performance on standard datasets. You are welcome to try things out. Given that tensorflow supports automatic differentiation, experimentation should be a lot smoother.

paulinpaloalto · July 28, 2023, 5:08pm

Sure, but the question is what is the dimension of your hidden state (the a^{<t>} values) and what is the dimension of your output space (what are the range of possible \hat{y} values)? There is no reason to believe that there is any relationship between those two values in the general case, right? And if those aren’t the same, then directly applying softmax is not going to work. That is the point of the weight and bias values there: you need a linear transformation to map to the correct target space.

Topic		Replies	Views
Programming Assigment: Softmax activation is not applied Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	508	January 2, 2022
Why do we need weights for y-hat, b in rnn's? Sequence Models coursera-platform	1	492	August 2, 2022
Why tanh and sigmoid in forward prop in RNN? Sequence Models coursera-platform	3	512	May 23, 2023
Softmax layer Clarification Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	553	August 7, 2021
Why transformer hidden layer use softmax as activation function Sequence Models coursera-platform	1	655	September 30, 2021

RNN FeedForward - Need for Weight and Bias matrix before softmax on activation for y hat

Related topics