Why do we need to multiply activation output with weight matrix and add bias matrix before applying softmax ?

Why not directly apply softmax on activation output ?

Can we train a model like this ?

While you can apply softmax on top of tanh, I’ve not seen 2 activations in succession on any NN. This operation in the diagram is like having a dense layer between the output of tanh layer and the final output.

As far as `tanh`

is concerned, it was used widely as an activation function in hidden layers before `relu`

became popular. In this case, the using tanh keeps outputs range [-1, 1] which should prevent activations growing large if only a linear activation was used.

The architecture was proposed after evaluating the performance on standard datasets. You are welcome to try things out. Given that tensorflow supports automatic differentiation, experimentation should be a lot smoother.

Sure, but the question is what is the dimension of your hidden state (the a^{<t>} values) and what is the dimension of your output space (what are the range of possible \hat{y} values)? There is no reason to believe that there is any relationship between those two values in the general case, right? And if those aren’t the same, then directly applying softmax is not going to work. That is the point of the weight and bias values there: you need a linear transformation to map to the correct target space.