I am trying to get my head around the GRU and LSTM concepts.
I understand that using weights that are trainable to get Gamma one can control the memory from one time-step to another. In the cat example the memory of the number of the cats is expected to be captured.
What if there are other things that need to be memorized?
Would these be captured in the gates of other neurons in the layer?
Can one unit capture one thing only?
I am trying to get my head around the GRU and LSTM concepts.
Let’s consider the example of GRU here, and then you can extend the below description to LSTM pretty easily.
Also, I am assuming that you are well aware of the fact that the memory cell holds a vector quantity, for instance, 32, 64, 128, units etc, and not just a single unit. In other words, if the memory cell has say 64-dimensions, then it can store 64 different values, and assuming that each unit can memorize one human interpretable feature, the memory cell will be able to memorize 64 different features simultaneously.
Moreover, even a single unit in the memory cell doesn’t memorize only a single feature essentially. For instance, it could club 2-3 different features (which are interpretable by humans individually) into a single feature, and then a single unit can learn that clubbed feature instead.
Let me know if this helps.
Thanks. For clarification, when you say memory cell you mean a layer and the length of that vector is actually the number of units. Correct?
We refer to RNN/GRU/LSTM as a layer, so I don’t know whether it would be correct to refer to something inside that (in our case, memory cell) as a layer too. Nonetheless, here you can find the tensorflow documentation for a LSTM layer. In this,
units determine the dimensions of the memory cell as well as dimensions of the output. A simple piece of code can help you validate that.
import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as tfl
lstm = tfl.LSTM(units = 45)
inputs = tf.random.normal([32, 10, 90])
a0 = tf.keras.Input(shape=(45,))
c0 = tf.keras.Input(shape=(45,))
output = lstm(inputs, [a0, c0])
For this code, try to change the shape of
c0 to anything but
45, which is the value of
units, and this will give an error. I hope this helps.
Right. So the LSTM or GRU layers don’t have units same as a simple RNN layer does. The units are actually dimensions for a rather single memory cell
This statement would be incorrect. If we check out the docs of a Simple RNN layer, you will find that it also has an argument
units, that denote the dimensionality of the output space, just like it denotes for the GRU and LSTM layers as well. The only difference is that in the case of GRU and LSTM layers,
units also denote the dimensionality of the memory cell.
a Dense layer also has an argument “units”. Would that be different to an RNN layer?
You can easily find the answers to this in the docs of Tensorflow itself. In fact, that will also help you gain confidence in finding answers by yourself. And if you want, you can easily write simple pieces of code like I wrote above to back your understanding further. I hope this helps.
My question is about the concept and not about the programming
But you are referring to the argument “units” if I am not wrong. And if you are clear on the theoretical difference (which I am assuming you are since Prof Andrew explained it quite clearly in their dedicated videos) between a dense layer and a RNN layer, then all you need to know is how Tensorflow uses the “units” arguments in both the layers.
I was referring to units from a theoretical perspective. Are units/neurons in a dense layer the same as units in RNN? What I mean would you consider the units in RNN neurons just like in a Dense layer (ANN)?
First of all, the number of “units” is nothing but the number of “neurons”, i.e., a single unit is a single neuron. So, I don’t know what are you trying to refer to when you say “units in RNN neurons”. But assuming you are using them interchangeably, let me try to give my 2 cents.
One way to think is that indeed “units” or the number of neurons in both a RNN layer and a Dense layer, decide the dimensionality of the output, and hence, I would consider them the same.
But at the same time, the computation that happens in a single neuron in a Dense layer and the computation that happens in a single neuron in a RNN layer, to produce the output for an input, are pretty different (which I am sure you are well aware of), and hence, I wouldn’t consider them the same.
So, I have presented both the perspectives that came to my mind, and now you can choose whichever you like. I hope this helps.
I meant units in RNN layer. And of course the calculations are different. Anyway, I think I got my answer. Thanks
So could we say that ‘units’ hyperparameter in LSTM() argument set the dimmension of C_t and H_t below?
Prior to see your replies, i’ve been thought the ‘units’ determine the number of timesteps in y_hat predictions.
and continue from here, how do we design the model architecture in tensorflow keras to output y_hat with desire number of timesteps. For instance, we want to make a sentiment classifier model, so we would like to only output 1 timesteps y_hat (Tx = len_sentences, Ty = 1). In other application if we want to make neural machine translation model so that we would like to output Tx timesteps y_hat (Tx = Ty).
In a RNN/LSTM/GRU layer, the argument “units” determine the dimension of the hidden activations
h. In the case of LSTM, it also determines the dimension of the memory cell state
c. This argument doesn’t have to do anything with the time-steps. I hope we are clear up to this point.
Now, the number of time-steps in
y_hat predictions is either determined by the number of time-steps in the input or by an argument “return_sequences”. Consider the SimpleRNN layer of Tensorflow, which you can find here.
- In the case of Sentiment Classification (Ty = 1) when you only want a single input corresponding to all the time-steps collectively, you can set
return_sequences = False.
- In the case of Named Entity Recognition or Language Modelling (Ty = Tx) when you want an output corresponding to every time-step, you can set
return_sequences = True.
- And finally, in the case of applications like Neural Machine Translation (Ty != Tx), you can use an Encoder-Decoder based architecture.
I hope this helps.