Decoder Network: Bi-directional RNN over Beam Search

Why wouldn’t we use a Bi-directional RNN in the decoder network? Why wouldn’t a Bi-directional RNN work in the decoder Network?

Where did you get that statement? I’m not sure the statement itself makes too much sense, and I don’t recall it being in the mentioned in the lectures.

With that said, decoders are often used to generate output text, which is one-directional behavior and so is suited for one-directional RNN instead of bi-directional RNN.

It wasn’t stated in the Beam Search lecture. But it was mentioned in Week 1 intro to RNNs. Why can’t you generate output text with a Bi-directional RNN? Uni directional doesn’t seem to make sense if we’re trying to predict a whole sentence. It seems incorrect to try to calculate the best probability of the overall sentence uni-directionally and having to calculate multiple times over. What probablity would we get out of the softmax of a Bi-directional RNN?

Bi-directional RNNs basically “Get information from the future.” Quote came from the slide deck for Week 1.

For example… why couldn’t we take the activation from the encoder network and feed it into both ends of a bi-directional RNN?

If you have the lecture video, that would be really helpful (since it’s unclear the context around that statement).

For text generation, where do you get information from the future? Let’s suppose the decoder is trying to generate the 3rd word in the sentence, where does it get the 5th “future” word (it hasn’t been generated yet!)?

Wouldn’t they all be generated together? Independent of eachother based on the logic of the Decoder Network. I’d imagine theoretically we would need much denser activations and more data for this to work?

Normally, the decoder generates one word (or “token”) at a time, rather than outputting the entire sentence all at once.

The generated words are (normally) also not independent from one another. For RNNs, we usually pass the previously generated word as input (together with the RNN hidden states) when generating the next word (at least when generating text, not necessarily when training).

How would this require denser activations or more data?

Well, maybe I’m confusing things a bit. I guess since this is a Sequential RNN we would want one word after the other. My logic for the Bidirectional RNN was how can we get the Network to formulate the sentence all at once. Which means that it would have to convert the meaning of the whole last sentence into a english sentence all at once. I guess that would imply a deeper learning network? Perhaps a denser Fully Connected Network (Non-RNN) Network with lots more data would achieve this.

I just got to the Attention Model video and this makes a lot more sense! Lol! My intuition seems to have been correct. I knew that the Bidirectional RNN fit into this a lot better somehow. Had that itch and just couldn’t put my finger on it. This way makes a lot more sense than the Beam Search.

Glad you’re able to find the concept you’re looking for with attention models. Attention models are indeed very powerful due to their ability to apply attention across all of the word pairings at the same time.

That’s right, what you’re explaining isn’t a RNN as it is not Recurrent (the “R” in RNN). RNNs are designed to generate one step at a time, using the extra data from the hidden state of the previous step.

It’s unclear whether or not a fully connected network, or any other network with lots more data, would be able to do a good job of generating the entire sentence all at once. The current state of the art models require billions of parameters just to do a decent job of generating one word at a time, so it’s hard to say how many parameters are needed to do an effective job of generating the entire sentence all at once (or whether it’s possible).

Just to be clear, attention models by themselves do not necessarily replace Beam Search. Beam Search is an algorithm that’s often used with Decoders to generate text output, and can be used with attention models as well.