I just finished Week 1 of Course 5 covering Recurrent Neural Networks. I am wondering when in practice one would use Deep RNNs vs. something like a single layer BRNN. The video mentioned more complex functions might require the use of multiple layers, but I was hoping to see some practical examples.
Also, intuitively, what are the deeper layers learning in this case? Making an analogy to CNNs, I would expect deeper layers to learn more complex features like grammar structures, combinations of words, etc., but I have a hard time visualizing how that would be the case just from stacking multiple RNNs on top of each other.
I think a simple answer to your first question would be that, when you have high bias, and increasing the RNN activations isn’t providing good returns, you might want to try stacking multiple layers.
With respect to the intuition question, when I think of an RNN layer, I sometimes think of what one of the output vectors would look like written as an explicit function of the sequence of input vectors. Over a long input sequence, you end up something that looks like a series of nested functions --f( f( f( f( … )))), with vectors further in time deeper in the nested structure-- but always using the same learned function determined by the parameter matrices and the activation. So while this allows for a lot of complexity over a long sequence (some of which is lost due to vanishing/exploding gradients), it’s limited by the behaviour of this single function. Imagine how much new possible behaviour is possible when you add a new parameterized function into the mix.
I hope this helps despite the fact that I’m fairly new and inexperienced to these ideas myself.
Hey @jcleung11, It seems that you have already got an answer. I just want to provide a little different perspective on that question.
Generally speaking it’s true for any neural network architecture (MLP, CNN, RNN, etc.) that deeper layers learn high-level representations – increase the model capacity and better fit patterns in the data.
The BRNN has a different purpose, it allows us to specify assumptions on how terms in the input sequences relay on each other. For example, in speech recognition, knowing a word ending narrow the choice of letters in the beginning of the word – resulting in more accurate recognition.