Hello!
Just finished week 4 material.
I am confused by the use of bidirectional LSTM in the model for text generation. When generating text, the model always looks forward, using the previous words (Let’s say words 0, 1, …, t-1) to predict the next (word t). But the words after that (t+1, t+2, …) do not yet exist to help with the prediction of word t. So how can a bidirectional model be of any use for making predictions?
Am I missing something here?
Let’s say you input 5 words to your text generation model. An LSTM layer looks at these words in the order of 0 -> t-1
to generate the output at the next timestep, t
. Bidirectional LSTM looks the inputs from both directions i.e. 0 -> t-1
and t-1 -> 0
to generate the output at the next timestep, t
.
2 Likes
Thank you, Balaji, for the quick response.
I understand that but my point is that the whole point of using bidirectional cells is to allow context from what comes later in the sequences. When predicting word t, if you do’t have words t+1, t+2, …, I guess you just use vectors of 0s for them which defeats the purpose of using bidirectional.
Sure. Bidirectional LSTM would be far more effective than vanilla LSTM when it comes to predicting a word, say, in the middle of a sentence.
However, for end of sequence prediction (i.e. text generation), vanilla LSTM is a good choice over bidirectional LSTM since vanilla LSTM in is faster to train and has fewer parameters than a bidirectional LSTM layer, all for the same level of accuracy as you pointed out.
One detail. Bidirectional layers are meant to carry information over longer input sequences than their unidirectional variants. You can read about it here. It’s worth experimenting to see if training with a bidirectional layer is worth the extra parameters.
1 Like
Yeah, I thought this.
It would be different if you had more than layer. E.g.
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100, return_sequences=True))
tf.keras.layers.LSTM(100)
Because then the later words in the input string would help work out the meaning of earlier words in the seconds LSTM. Then the second LSTM would pass this information forward to the last cell.
I don’t really think C3_W4_Lab_1.ipynb
should be using bidirectional with only one LSTM layer.
There’s one detail to note here.
Bidirectional layers lead to more stable gradient propagation across longer sentences since the sentence is processed from both directions.
So, there’s nothing wrong with using them in place of unidirectional LSTMs.
See this as well.
Sure. My point is that if you have one bidirectional LSTM and you are only predicting the last token, then only one cell of the backward LSTM gets used.
Please don’t mix the concept of viewing an unravelled view of an RNN type and the actual representation of an RNN type.
While the unravelled view is meant for understanding purposes, the same RNN setup is used across all timesteps (see BPTT).
Since the same RNN type cell is use, it helps with better learning, especially across longer sequences.