Q1: In 2.1 - Attention Mechanism, is 𝑎⟨𝑡’⟩�⟨�⟩ instead of 𝑎⟨𝑡⟩�⟨�⟩ more reasonable?
Q2: In Exercise 2 - modelf (11th code cell) since we can define a layer using just 1 line(step 1) a = Bidirectional(LSTM(units = n_a, return_sequences = True))(X)
, why we still have to implment a for loop in step 2? Did I miss sth?
I don’t understand your first question but the answer to your second question is in the notebook.
Ty -- length of the output sequence
.
.
.
# Step 2: Iterate for Ty steps
Right! The other thing to note is that there are two LSTMs involved here, right? The bidirectional one is the pre-attention one and it happens outside the for loop. The loop is for the post-attention model which is not bidirectional, right?
I’m also having trouble with question 1, but I think the distinction they are making there between a^{<t>} and a^{<t'>} is that the attention mechanism looks at the a values from all the timesteps of the bidirectional “pre-attention” LSTM when generating the attention for a particular timestep of the post-attention LSTM. Where the “attention” goes at each of the final timesteps may need to be different, right? The point is that is actually being learned during training. But this stuff is pretty complicated and I need to go back and rewatch the video they reference there to make sure I’ve got that clear in my mind.
Thanks for replying.
In Q1 I mean in the highlight line of snap 𝑎⟨𝑡⟩ should be 𝑎⟨𝑡’⟩.
Because I see in the lecture note C5_W3.pdf it is 𝑎⟨𝑡’⟩
Thanks for replying!
What I mean is can we create the post-attention layer use only one line like the bidirectional layer?
If we can’t , is it because in post-attention layer we have to use the last step’s result and write it explicitly(since keras doesn’t offer a function wrap these operations)?