Hi everyone,
In the video Attention Model, it was explained that the hidden state of the post-attention LSTM, denoted as s^t
, is influenced by the previous hidden state s^{t-1}
, the context vector c^t
, and the previous output y^{t-1}
. Could someone explain the exact formula or steps used to calculate s^t
? Is it the same as a^{t}
in the second screenshot from week 1?
For a standard LSTM, without attention, the a^t corresponds to the hidden state after applying the output gate. In the case of the attention model, s^t in the post-attention LSTM follows a similar structure but now has additional input from the attention context vector c^t . Thus, s^t is essentially equivalent to a^t , but with the context of the attention mechanism included. In this respect, s^t in the post-attention LSTM behaves similarly to a^t in a regular LSTM, except that it incorporates the context vector c^t along with the previous hidden state s^{t-1} and previous output y^{t-1} .
In the case of the attention model, the LSTM cell takes the context vector c^t as an additional input. So the input to the LSTM cell becomes [ y^{t-1}, c^t ] instead of just x^t , where y^{t-1} is the previous output. The steps remain similar:
-
Candidate Cell State \tilde{c}^t :
\tilde{c}^t = \tanh(W_c [ s^{t-1}, c^t, y^{t-1} ] + b_c)
-
Update Gate \Gamma_u :
\Gamma_u = \sigma(W_u [ s^{t-1}, c^t, y^{t-1} ] + b_u)
-
Forget Gate \Gamma_f :
\Gamma_f = \sigma(W_f [ s^{t-1}, c^t, y^{t-1} ] + b_f)
-
Output Gate \Gamma_o :
\Gamma_o = \sigma(W_o [ s^{t-1}, c^t, y^{t-1} ] + b_o)
-
Cell State Update c^t :
c^t = \Gamma_f \cdot c^{t-1} + \Gamma_u \cdot \tilde{c}^t
-
Hidden State s^t :
s^t = \Gamma_o \cdot \tanh(c^t)
2 Likes
Thank you for your thorough explanation! This helps me connect the dots.
1 Like