Decoder Network: Bi-directional RNN over Beam Search

ngkhatu · December 23, 2023, 12:35am

Why wouldn’t we use a Bi-directional RNN in the decoder network? Why wouldn’t a Bi-directional RNN work in the decoder Network?

hackyon · December 23, 2023, 12:57am

Where did you get that statement? I’m not sure the statement itself makes too much sense, and I don’t recall it being in the mentioned in the lectures.

With that said, decoders are often used to generate output text, which is one-directional behavior and so is suited for one-directional RNN instead of bi-directional RNN.

ngkhatu · December 23, 2023, 1:26am

It wasn’t stated in the Beam Search lecture. But it was mentioned in Week 1 intro to RNNs. Why can’t you generate output text with a Bi-directional RNN? Uni directional doesn’t seem to make sense if we’re trying to predict a whole sentence. It seems incorrect to try to calculate the best probability of the overall sentence uni-directionally and having to calculate multiple times over. What probablity would we get out of the softmax of a Bi-directional RNN?

Bi-directional RNNs basically “Get information from the future.” Quote came from the slide deck for Week 1.

ngkhatu · December 23, 2023, 1:32am

For example… why couldn’t we take the activation from the encoder network and feed it into both ends of a bi-directional RNN?

hackyon · December 23, 2023, 2:18am

If you have the lecture video, that would be really helpful (since it’s unclear the context around that statement).

For text generation, where do you get information from the future? Let’s suppose the decoder is trying to generate the 3rd word in the sentence, where does it get the 5th “future” word (it hasn’t been generated yet!)?

ngkhatu · December 23, 2023, 2:31am

Wouldn’t they all be generated together? Independent of eachother based on the logic of the Decoder Network. I’d imagine theoretically we would need much denser activations and more data for this to work?

hackyon · December 23, 2023, 2:51am

Normally, the decoder generates one word (or “token”) at a time, rather than outputting the entire sentence all at once.

The generated words are (normally) also not independent from one another. For RNNs, we usually pass the previously generated word as input (together with the RNN hidden states) when generating the next word (at least when generating text, not necessarily when training).

How would this require denser activations or more data?

ngkhatu · December 23, 2023, 6:01pm

Well, maybe I’m confusing things a bit. I guess since this is a Sequential RNN we would want one word after the other. My logic for the Bidirectional RNN was how can we get the Network to formulate the sentence all at once. Which means that it would have to convert the meaning of the whole last sentence into a english sentence all at once. I guess that would imply a deeper learning network? Perhaps a denser Fully Connected Network (Non-RNN) Network with lots more data would achieve this.

ngkhatu · December 27, 2023, 5:28am

I just got to the Attention Model video and this makes a lot more sense! Lol! My intuition seems to have been correct. I knew that the Bidirectional RNN fit into this a lot better somehow. Had that itch and just couldn’t put my finger on it. This way makes a lot more sense than the Beam Search.

hackyon · December 30, 2023, 7:36pm

Glad you’re able to find the concept you’re looking for with attention models. Attention models are indeed very powerful due to their ability to apply attention across all of the word pairings at the same time.

That’s right, what you’re explaining isn’t a RNN as it is not Recurrent (the “R” in RNN). RNNs are designed to generate one step at a time, using the extra data from the hidden state of the previous step.

It’s unclear whether or not a fully connected network, or any other network with lots more data, would be able to do a good job of generating the entire sentence all at once. The current state of the art models require billions of parameters just to do a decent job of generating one word at a time, so it’s hard to say how many parameters are needed to do an effective job of generating the entire sentence all at once (or whether it’s possible).

Just to be clear, attention models by themselves do not necessarily replace Beam Search. Beam Search is an algorithm that’s often used with Decoders to generate text output, and can be used with attention models as well.

Topic		Replies	Views
Why use Bi-directional LSTM in encoder and not within Pre-attention decoder NLP with Sequence Models week-1	1	35	November 17, 2024
Question about the use of Bidirectional LSTM for Text Generation Natural Language Processing in TensorFlow week-4	8	350	October 18, 2024
BRNN Inputs Sequence Models	1	316	December 29, 2023
C5_W3_A2 Question about the architecture Sequence Models week-3	7	22	September 3, 2024
Bidirectional layer for time series forecasting Sequences, Time Series and Prediction week-3	3	492	July 21, 2023

Decoder Network: Bi-directional RNN over Beam Search

Related topics