I have a basic theoretical question related to the RNN introductory videos. When the backprogragation is explained for a typical many-to-many RNN, the entire sequence is used for a single forward path and backprop update (when the loss is the sum of the individual loss values on the y_hat<i>
outputs).
My question is why cannot we do a forward/backward update step-by-step (i.e. doing a forward/backward update for the first timestep, then using the updated weights to train the next timestep). Note, that the history/hidden state (a<i>
) is still carried over (I am not talking about a one-to-one trivial MLP). It seems this would “mitigate” the vanishing gradient problem (although, the problem of retaining old information in the hidden state is still valid, but this is somewhat different from the original vanishing gradients problem).
Or, asking the same question from a slightly different angle: if the sequences are extremely long, you definitely want to break-up the forward/backprop updates to batch segments (while carrying over the hidden state across the batches). What is the trade-off of selecting the length of the segment (or even just picking 1 step at a time). Is it just for computational efficiency or the whole learning process is compromised?