andrew said here at 11:00 that GRU is helping with vanishing gradient.
i see that it only helps when sigmoid have large negative number since the update will be 0 and will keep using c(t-1) and pass it forward, but if sigmoid have large positive number, the update parameter will be 1 and update the c(t) with c~(t) .
i am i missing something ?
Your understanding is correct in that when γu is close to 1 (large positive input to sigmoid), c(t) ≈ c~(t), meaning the GRU updates the cell state with the new candidate value, allowing the network to learn new information. When the update gate γu is close to 0 (large negative input to sigmoid), c(t) ≈ c (t−1), meaning the cell state is preserved, which helps in retaining information over long sequences and mitigates the vanishing gradient problem. This balance between updating and preserving information is what helps GRUs manage the vanishing gradient problem, ensuring that gradients can flow across many time steps without diminishing too much.
2 Likes