In C3_W2 assignment, there is an API
tl.GRU(n_units) and we want to repeat this GRU for
n_layers (default 2). My question is that this seems that there are 2 GRU units connected in serial, but the
max_length of the sentence is 64 thus way bigger than 2. In my understanding, the GRU network takes one token (here character) at each GRU unit. Does this mean that the network can only take the first 2 characters as inputs and ignore the rest 62 characters? How can all 64 characters fit in this GRU network?
Thanks for the prompt reply. I think I am getting it now. Since all the GRU cells share the same weight, any number (in this case
max_length) of GRU cells can be connected in serial to form a GRU layer. At the end of the day, we are not increasing trainable parameters because they all share the same weights and biases. In the vertical direction, multiple GRU layers (so called deep GRU) can be used (default
n_layer is 2), and that will increase the trainable parameters.
Just a little detail. Each GRU cell, even within a layer, has its own weighs and biases. Since we use the same GRU cells across all timesteps, you can say that cell configuration is shared across timesteps.
The number of trainable parameters will change as you change the units of the GRU layer but will not change based on number of timesteps of data that’s input to the GRU layer.
The link given to you has time along x-axis and stacked cells along y-axis.