Backpropagation Through Time and Vanishing Gradient (RNN)

David_Farago · September 3, 2022, 1:10pm

think of an RNN for language modeling and Andrew’s example “The cat, which ate a lot of chocolate cookies, were full and left the dinner untouched.”, i.e. the current RNN produces \hat{y}^{<10>} that has a higher probability for “were” compared to “was”. Thus \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) produces a high error, but all \mathcal{L}^{<t>}(...), t<10 are low. Thus Thus \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) must be the cause for the weights being updated in such a way that the RNN stores singular as information in the hidden states a^{<2>}, .. a^{<10>} when it sees “cat”.

What I do not understand: Since each time step uses the same weights, why must backprop propagate the error \mathcal{L}^{<10>}(\hat{y}^{<10>}, “was”) back to t=2 for the relevant weight update? Why not update the weights directly for t=10?

Topic		Replies	Views
Vanishing Gradient RNN Sequence Models coursera-platform	7	536	April 6, 2022
W1, A1, Ex. 6, Vanishing Gradients Sequence Models coursera-platform	1	408	July 13, 2023
Derivation of Backpropagation in RNNs Sequence Models week-module-1 , coursera-platform	4	111	May 26, 2024
Derivation of backpropagation of RNN Sequence Models coursera-platform	2	810	June 5, 2022
RNN parameters understanding and vanishing gradient Sequence Models coursera-platform	2	504	September 16, 2022

Backpropagation Through Time and Vanishing Gradient (RNN)

Related topics