Model architecture: Embedding dimension size and GRU number of cells

Hey @Elemento

First, let me group all flavors of Recurrent neural networks under the same name - RNN. What I’m talking about in this post applies to all of them.

Second thing to note, confusion results from the different words to represent the same thing across different ML communities. One of the best examples of what I mean is - RNN “layer” vs. RNN “cell”, RNN “unit” and RNN “hidden_size” (or “feature size”, “d_model” etc.).
So to make things clear:
(“units” by TensorFlow) === (“n_units” by Trax) === (“hidden_size” by PyTorch) mean the same thing - output size of the RNN. I personally think PyTorch term is the least confusing. For me, the best thing would have been output_size or step_dim_out.

Now, the important part where the confusion arises most of the time - RNN “cell” or RNN “layer” are almost the same thing, except that RNN “cells” operate a dimension lower and cannot take in 3D input, so you would have to take care of that yourself (by for loops or whatever). In addition, PyTorch at least provides a way to pass num_layers for RNN “layer”, which makes a more convenient way for multi-layer RNNs.

I personally find concrete calculations most informative so I usually do that. Here is an example what I mean:


and for 3D Tensor:

The main point is that if you provide the same weights and juggle with input dimensions you receive the same result regardless if you work with RNN “cell” or RNN “layer”. Common misconception is that RNN “layer” is somewhat increases the output of dimensionality compared to RNN “cell”.

So regarding your Doubt 1:

I meant that RNNs (“Cell” and “Layers”) operate on last the dimension (in the example - 2 changes to 4). For sequences RNNs are fed inputs of shape of (batch_size or None x sequence_length x feature_size). Sequence length have no influence on RNN weights’ sizes. I could imagine a toy version of using RNNCell when feature size is a scalar (for example a word is represented by scalar, or a stock price represented by a scalar) and the input is (batch, words) or (batch, prices), then yes - you would have to force each sequence of words or prices to be the same length but I think this use case is more of an exception. I think this is what you meant by custom implementations. By the way, I can imagine this kind of use for some super complicated models (something like tree-like structure where each branch operates on single scalar and these RNNCells branch out or smth but that would be another extreme of exceptional use).

Regarding your Doubt 2:

The way it is implemented in the Assignment I guess is for simplicity and not efficiency. You could take the weights and rewrite the function (and model) for efficiency (changing RNN would not suffice because you need the embedding of the last somewhat randomly sampled input).

Regarding your Doubt 3:

You are correct, I made some typos :slight_smile: including “input size” instead of “output size”. I’m sure I made some typos here too, please correct me if you see some :slight_smile:

Cheers,
Happy New Year.

1 Like