I understand how Back Propagation Through Time (BPTT) works in Many-to-One RNN architecture.
For example, dW_aa or partial derivative of loss function with respect to W_aa will equivalent to the following equation (Using BPTT)
(Correct me if this is still wrong)
But when it comes to Many-to-Many RNN
I’m not confident enough to state that my understanding is correct, please check my correctness. Is this true ? All I add from the previous equation is the loss associate to each output unit
So, the number of term in the summation comprises of (t_x) + (t_x - 1) + (t_x - 2) + … + (1) = (t_x)!