Hi,

My question is a little bit complicated. Please bear with me for a moment. Suppose that we have a RNN structure for three input sequences (many-to-many).

The formula we calculate the gradients (\partial L / \partial W_{s}) are:

\dfrac{\partial L}{\partial W_{s}} = \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial W_s} + \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial W_s} + \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_s} \\ + \dfrac{\partial L^{<2>}}{\partial y^{<2>}} \dfrac{\partial y^{<2>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial W_s} + \dfrac{\partial L^{<2>}}{\partial y^{<2>}} \dfrac{\partial y^{<2>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_s} \\ + \dfrac{\partial L^{<1>}}{\partial y^{<1>}} \dfrac{\partial y^{<1>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_s}

Similar formula can be applied when calculation \partial L / \partial W_x:

\dfrac{\partial L}{\partial W_{x}} = \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial W_x} + \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial W_x} + \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_x} \\ + \dfrac{\partial L^{<2>}}{\partial y^{<2>}} \dfrac{\partial y^{<2>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial W_x} + \dfrac{\partial L^{<2>}}{\partial y^{<2>}} \dfrac{\partial y^{<2>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_x} \\ + \dfrac{\partial L^{<1>}}{\partial y^{<1>}} \dfrac{\partial y^{<1>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_x}

When vanishing gradient is taught, it is said that the effect of gradient becomes negligible for earlier time steps. That sounds reasonable when we are computing the third term in the first lines of equations above. To be clear about it, I am going to write that term again below.

\dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_s} \\ \dfrac{\partial L^{<3>}}{\partial y^{<3>}} \dfrac{\partial y^{<3>}}{\partial s^{<3>}} \dfrac{\partial s^{<3>}}{\partial s^{<2>}} \dfrac{\partial s^{<2>}}{\partial s^{<1>}} \dfrac{\partial s^{<1>}}{\partial W_x}

1-Taking into consideration that these gradients are not only comprised of long-chained derivatives but also one-time chained or two-times chained derivatives, how does vanishing gradient affect the total gradients?

2-Even though long-chained terms are close to zero, we still have one-time chained or two-times chained derivatives. Those should not be vanished right? Since we take the sum all these terms, how come does overall gradient vanish?

3-How about we have many-to-one structure or different RNN structure?

4-What role does parameter sharing play for this particular problem?

Thanks!

PS: I got the image from geeksforgeeks. The model prediction for the second time step should be Y_2 not Y_3. Also, I chose the RNN model with three time step for simplicity. My question can be extended to the higher time steps.