GRU and vanishing gradient

DonFeto · July 31, 2024, 4:10pm

andrew said here at 11:00 that GRU is helping with vanishing gradient.

i see that it only helps when sigmoid have large negative number since the update will be 0 and will keep using c(t-1) and pass it forward, but if sigmoid have large positive number, the update parameter will be 1 and update the c(t) with c~(t) .

i am i missing something ?

nadtriana · August 5, 2024, 2:56pm

Your understanding is correct in that when γu is close to 1 (large positive input to sigmoid), c(t) ≈ c~(t), meaning the GRU updates the cell state with the new candidate value, allowing the network to learn new information. When the update gate γu is close to 0 (large negative input to sigmoid), c(t) ≈ c (t−1), meaning the cell state is preserved, which helps in retaining information over long sequences and mitigates the vanishing gradient problem. This balance between updating and preserving information is what helps GRUs manage the vanishing gradient problem, ensuring that gradients can flow across many time steps without diminishing too much.

Topic		Replies	Views
Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem? Sequence Models	5	559	June 26, 2022
Gated Recurrent Unit gates NLP with Sequence Models week-1	7	434	August 20, 2024
Sequence Models Week 1 Quiz Sequence Models	15	731	December 4, 2024
Week1 Quiz doubt Sequence Models	1	497	April 16, 2023
C5-W1-quiz GRU question Sequence Models	9	689	May 24, 2022

GRU and vanishing gradient

Related topics