Assume the input sentence is “I love mango juice” for the bidirectional RNN. At time step 1, the RNN in the forward direction receives “I” as the input while the RNN in the backward direction receives “juice” as the input. Is the output of time step 1 the combined output (e.g. concatenated output) of the forward and backward RNNs?
At time step 2, the RNN in the forward direction receives “Iove” as the input while the RNN in the backward direction receives “mango” as the input. Is the output of time step 2 also the combined output (e.g. concatenated output) of the forward and backward RNNs?
Thanks in advance.
Output at timestep 1 is the concat of “I” from both the forward and backward passes. See this link to learn about the merge_mode
parameter.
Hope this clears any doubts on why the entire sequence should be known for this bidirectional layer to be effective.
Why output at timestep 1 is the concat of “I” from both the forward and backward passes? I thought the backward RNN at time step 1 received “juice” as the input.
Bidirectional RNN aims to learn about a token from both directions:
- Forward direction: starting from start of sentence and ending at current token
- Backward direction: starting from the end of sentence ending at current token.
I don’t quite get your explanation. Could you explain in terms of what happen at time step 1, time step 2 and etc.?
Thanks.
Why don’t you go through these links and answer your question?
- Read Bi-Directional Recurrent Neural Network section of the medium article.
- Keras source code
Other mentors have been following this conversation. I don’t personally know the answer right at the moment. The reason is that we never actually build a bidirectional network by hand, in the way of the “Step by Step” exercise in Week 1. When we get around to using them, we just use the TF/Keras Bidirectional module, which is a “black box” and our choices are to read either the documentation that Balaji already gave us or (gulp) the source code he also linked for us.
But my next step is to watch the lecture that Prof Ng gives specifically on this topic. That will be our best hope of getting to a better level of understanding. I watched it back in 2019 when I first took this course, but the memory is not current any more. In my written notes, it just shows the diagram, but doesn’t have enough detail. I’m assuming you’ve watched the lecture already, but it might be worth watching again. My personal life is quite busy in the next 6 to 8 hours and then it’ll be bedtime where I am. So it’s not likely I will have time to watch the lecture in the next 18 hours or so.
But there may well be other folks who see this and have more information. Maybe we get lucky and one or more of them will chime in before I get time to watch the lecture again.
@hungng777 Ha. I don’t want to be too simple here and am maybe not the smartest one here to help.
But to go alll the way back to your original question the answer is simply: ‘Yes’.
Where I think you might be getting (understandably) confused:
Your data point (X) order never changes, nor do your time steps (T)-- I.e. They are always sequential (X^{<1>}, X^{<2>}, etc and your timesteps T^{<1>}, T^{<2>}, etc)-- Yet in your reverse transversal, it is only your network that runs in reverse order; I.e. You feed it X^{<4>}, X^{<3>}, etc.
Like in “I love mango juice”, it gets too confusing to think of ‘juice’ as now X^{<1>}. Don’t do that. It is still X^{<4>}, you’ve only changed your data order.
Hope that helps.
*Also, obviously, you have to complete the entire back and forth path before you can concatenate.
2 Likes
Thank you for the explanation. I just want to know more about bidirectional RNN and its variants (e.g. bidirectional LSTM) since they are the basic building blocks of more advanced constructs (e.g. attention model).
I just watched this lecture again. Sorry it took me a few days. I think Anthony’s answer covers the fundamental question, but if you use this screen shot slightly later in the video:
It gives a more complete picture of how the
\hat{y}^{<n>} values are computed. You can see that the
W_y weight matrices are used to process both the forward and backward states. At each timestep, you have two separate “cell state” values: one for the forward direction and one for the reverse direction. These are separate from each other. What will be contained in those states is determined by whether you select a plain RNN architecture, a GRU or LSTM architecture. But with whichever architecture you have chosen, you do the “forward prop” in both time directions. Then you’d do back prop to learn the weight values based on the comparison of the
\hat{y}^{<n>} values and the
y^{<n>} at all the timesteps. Of course that will also drive updates to the other weight and bias values that are used to compute the updated cell states in both directions. Those weights will be separate between the directions, because the states are distinct.