Hi everyone,
In the video Attention Model, it was explained that the hidden state of the postattention LSTM, denoted as s^t
, is influenced by the previous hidden state s^{t1}
, the context vector c^t
, and the previous output y^{t1}
. Could someone explain the exact formula or steps used to calculate s^t
? Is it the same as a^{t}
in the second screenshot from week 1?
For a standard LSTM, without attention, the a^t corresponds to the hidden state after applying the output gate. In the case of the attention model, s^t in the postattention LSTM follows a similar structure but now has additional input from the attention context vector c^t . Thus, s^t is essentially equivalent to a^t , but with the context of the attention mechanism included. In this respect, s^t in the postattention LSTM behaves similarly to a^t in a regular LSTM, except that it incorporates the context vector c^t along with the previous hidden state s^{t1} and previous output y^{t1} .
In the case of the attention model, the LSTM cell takes the context vector c^t as an additional input. So the input to the LSTM cell becomes [ y^{t1}, c^t ] instead of just x^t , where y^{t1} is the previous output. The steps remain similar:

Candidate Cell State \tilde{c}^t :
\tilde{c}^t = \tanh(W_c [ s^{t1}, c^t, y^{t1} ] + b_c)

Update Gate \Gamma_u :
\Gamma_u = \sigma(W_u [ s^{t1}, c^t, y^{t1} ] + b_u)

Forget Gate \Gamma_f :
\Gamma_f = \sigma(W_f [ s^{t1}, c^t, y^{t1} ] + b_f)

Output Gate \Gamma_o :
\Gamma_o = \sigma(W_o [ s^{t1}, c^t, y^{t1} ] + b_o)

Cell State Update c^t :
c^t = \Gamma_f \cdot c^{t1} + \Gamma_u \cdot \tilde{c}^t

Hidden State s^t :
s^t = \Gamma_o \cdot \tanh(c^t)
2 Likes
Thank you for your thorough explanation! This helps me connect the dots.
1 Like