I have questions about this slide
Hontestly, I don’t understand how it works. So, we have 1-hot input of X. As I remember we need tanh and sigmoid to classify elements to 2 groups. So one of this classifications is influence on the next classification. Can you explain what happens here?
In binary classification problems, we use sigmoid
as the output activation but not tanh
. But both of these are perfectly valid activation functions for the internal layers of a network. Of course an RNN is a little different than a multi-layer FC net or CNN. With an RNN, there is just one “cell” and it gets used repeatedly at each “timestep” and it has two outputs: the new “hidden state” a^{<t>} and the actual output of that timestep which is \hat{y}^{<t>}. Of course the other thing about RNNs is that they come in lots of types and it depends on what the output is at each timestep. In the example Prof Ng is showing here, it must be a “yes/no” answer of some sort, but it’s also very common for it to be a softmax output (e.g. in a translation problem). Notice that tanh
is used as the activation on the “hidden state”, so those values can be both positive and negative.
Also note that activation functions are always applied “elementwise”, so it really depends on what the \hat{y} values represent at each timestep. You don’t give a reference to which lecture the slide is from. Knowing that might shed a bit more light here. But maybe the best next step is to rewind and watch the lecture again with what I said above in mind. I’ll bet it will make more sense the second time through now that you have a bit more context.