Hi everyone,
I have completed the first four courses of this specialization, which focused on neural networks and convolutional networks. I have a clear understanding of these topics, as they seemed logical to me.
Now, I have started learning about RNNs, and I’m having trouble grasping how they predict future outcomes from the input data and how the gates work.
For example, let’s consider the sentence “The cat is under the table.” Should the system assign high importance to the words “cat” and “table”? If the system assigns a high weight to “cat,” why would it be beneficial if it ignores it when processing the word “the” since the foregate gate will decrease the value passed to the next cell and so on?
I’m thinking about it in another way.
If the input is a matrix of shape (1, 10,000), then the weights should be something like (10,000, some number). By learning and finding the correlation between the target and the input, the model will update the weight values corresponding to the word “cat” with a high value, while assigning lower values to the weights of unimportant words. Then, in the next round of processing, it will effectively ignore the effect of unimportant words. Is my understanding correct?
I don’t have any issues with the programming and mathematical aspects of the topic. My main challenge lies in understanding how the system is able to predict results and how handle sequences of data.
I would appreciate any explanations to help me better understand this concept.
Thank you.
I think one key thing that is missing from your description of how things work in a Sequence Model is the idea of the “hidden state” of the RNN cell. If the inputs are from a vocabulary of 10,000 words, then you’re right that the weights for the inputs will be a 10000 x 1 vector, but the key point is that is not the only set of parameters that the RNN is learning during training. There is also the hidden state, which Prof Ng calls a. The dimension of that is a hyperparameter, meaning that you as the system designer need to choose how complex that hidden state needs to be in order to achieve the goals of your system. And of course that state can have more structure than just a if you are implementing an LSTM or GRU RNN. Conceptually the way to think about the hidden state is that it remembers state that is important for understanding the meaning of the sentence (assuming that we’re implementing something like an English to French translation system): for example which word is the subject of the sentence and whether the subject was singular or plural and which word is the verb and where a subordinate clause begins. Of course I’m just making up those attributes as things that the hidden state learns to discern. I’m not aware of any work like the DLS C4 W4 lecture What Are Deep ConvNets Learning, in which they show how to instrument the internal layers of the network to get an idea of what patterns a given neuron is detecting. Maybe someone has done a similar level of analysis on RNNs, but I’m not aware of it.
So there is still some level of “magic” in all this, but the RNN learns both the weights for the inputs and the hidden state and any additional gates from GRU or LSTM architectures. The other key idea is that there is only one “cell” that is used repeatedly to process all the inputs in any given input sequence. Meaning that the parameters are shared and affect the way all inputs change the state at each timestep.
Prof Ng does a much better job of explaining all this than I can hope to do, of course. It might be worth watching some of the lectures again with what I said above in mind and see if that sheds any additional light.