Hi @sahina,
think of an RNN for language modeling and Andrew’s example “The cat, which ate a lot of chocolate cookies, were full and left the dinner untouched.”, i.e. the current RNN produces \hat{y}^{<10>} that has a higher probability for “were” compared to “was”. Thus \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) produces a high error, but all \mathcal{L}^{<t>}(...), t<10 are low. Thus Thus \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) must be the cause for the weights being updated in such a way that the RNN stores singular as information in the hidden states a^{<2>}, .. a^{<10>} when it sees “cat”.
What I do not understand: Since each time step uses the same weights, why must backprop propagate the error \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) back to t=2 for the relevant weight update? Why not update the weights directly for t=10?