in the screenshot, we see that we need weights to compute the activation function a (marked in green). However, we also see that we need a separate set of weights and bias to compute y-hat, which I don’t really understand. Why do we need these separate set of weights? Isn’t y-hat the result of passing the value a
through an activation function, like softmax?
I think you should watch the lectures again. What you are missing is that the a^{<l>} values are not just the output of an activation as in FC nets (Course 1) and CNNs (Course 4). Here they are the “hidden state” of the RNN model. That hidden state is modified by the inputs at each timestep (x^{<l>}) and the previous value of the hidden state using one set of weights. There is an activation function involved in calculating \hat{y}^{<l>}, but there are also other weights to apply before that.
Maybe the clearer way to state this is that there are two outputs at each timestep:
- A modified “hidden state” that is fed to the next timestep
- The actual \hat{y} output of the timestep
Each of those outputs involves a set of inputs, a set of weights and an activation function.
This will also become more clear when you get to the first assignment and actually have to write the code to implement all this.