Model architecture: Embedding dimension size and GRU number of cells

Hi, I’m trying to understand the GRUs better. Why does the number of units in the GRU layer equal the embedding size? My understanding is the number of units (i.e., number of cells) in the GRU layer is supposed to be equal to the maximum length of a particular sentence, since the each cell is processing a single token in a tweet (or a single character in a word). However, the embedding dimension is completely independent of this. Would it be possible for someone to explain the why GRU layer’s number of cells shouldn’t be fixed to the maximum input sequence length instead?

1 Like

Hi, I just stumbled on this very question. My guess: Your understanding is correct since the cell has to be exercised for every token fed to it, up to max_len; and, the “number of units in the GRU layer” is a bit of a misnomer and only refers to the vector dimension it works with (IMO trax uses too loosely the “layer” term, probably to simplify things).

It’s a shame that there doesn’t seem to be any life in this forum, particularly mentors and such explaining and enriching issues.

1 Like

This is a great question @trollybaz, I will try to give my reasoning -

Think of this way - if you set the number of GRU cells to max. input sequence length, your model becomes sensitive to the text you are processing. What you really want is to use the same model architecture on different pieces of text (from different datasets [padding won’t be effective here]) for effective comparison.

Another way to look at this is that # GRU cells should essentially be a representation of your feature space. Since, the features here are the word embeddings, we use that.

I really hope more people contribute here so we can understand this better.

~ Ani

Hey @map9, @trollybaz and @Anivader,
This query is mainly discussing the following point which is mentioned in the ungraded lecture notebook 4.

The hidden state dimension should be specified as n_units and should match the number of elements in the word embedding --by design in Trax.

Now, first allow me clear out on some things, which I believe seems to be making this discussion tread away from it’s path. There are 2 things, which we are talking about. First is, the dimensions of the hidden state, i.e., h_t, and second is, the number of cells in a GRU layer.

Let’s begin with the easy thing to discuss, the number of cells in a GRU layer. It can be set in 2 different manners.

  1. If you are training the GRU network sample by sample, i.e., batch-size = 1, in that case, the number of cells is equal to the number of tokens (words) in the sample.
  2. And if the batch-size >= 1, in that case, we can pad/trim all the samples, so that they have the same number of tokens each. In this case, the number of cells is equal to the pre-determined length set for the samples of for all the batches. This isn’t necessary the length of the longest sentence in the dataset, since we can trim the samples as well.

The essential thing to note here is that since all the GRU cells in a GRU layer uses the same weights, hence, the number of cells can vary as per the number of tokens, wherein we have 1 GRU cell for 1 token each. Also, you will notice that when computing the loss, and hence, the gradients, it takes into account the number of cells by averaging the gradients, so, you don’t need to worry for that.

Now, let’s talk about the dimensions of the hidden state. In Trax’s code-base for the GRU layer, which you can find here, we will find the following code:

if input_signature[1].shape[-1] != self._n_units:
      raise ValueError(
          f'Second argument in input signature should have a final dimension of'
          f' {self._n_units}; instead got {input_signature[1].shape[-1]}.')

This reflects the statement that is mentioned in the ungraded lab, which I also mentioned towards the beginning of my answer. Now, first, let’s note some key points

  1. The hidden state dimension doesn’t have any relation with the number of GRU cells in a single layer.
  2. The hidden state dimension also doesn’t have any relation with the number of elements in the word embeddings (word embeddings dimensionality).

The second point is of more importance of us, as of now. Had there been some relation, it would have been mentioned in the innumerable discussions and explanations about the GRU layer. Even other frameworks like Tensorflow and PyTorch don’t have any restriction. In simple words, we can have each word embedding to have a dimensionality of say 500, and the hidden state to have a dimensionality of 100.

But for some reason, Trax imposes a restriction. If the word embeddings have a dimensionality of 500, then in that case, we must set the hidden state to also have a dimensionality of 500, and nothing other than that. I searched through the documentation, stack overflow, github issues on the documentation, gitter community of Trax, but no where, I can find the reason behind this approach.

Although there are some references revolving around this, but all the queries were either too erroneous, or were left unanswered. So, as to why Trax imposes this restriction, this is a mystery to me as well.

Let me tag some other mentors as well, perhaps they would know something about this.

Hey @arvyzukai, @reinoudbosch, @paulinpaloalto, can you please look into this query once, and provide us with your thoughts on this?


Hey @Elemento

First thing - RNNs are never about sequence length - that is their dimensionality does not directly depend on the sequence (token/words/chars whatever) length. RNNs (be it vanilla RNN, GRU, LSTM or other) depend only on the last/feature dimension - in this (C3_W2) case on Embedding dimension. And, of course, they do not change dimensionality while processing sequences. This is common misconception and it is very important to understand.

The second part is a better question - why trax does not let us specify the hidden state size (which could be different from the output input size as you know from other DL frameworks). Frankly, I don’t know the answer, but I think they left it to us to customize these layers ourselves.

For example, if I want my first GRU to output 1024 dimensional vectors for every step I could rewrite the GRU layer like that (note Dense layer (at line 8) which transforms the output to n_out.)

def my_GRU(n_units, n_out, mode='train'):
    """GRU running on axis 1."""
    zero_state = tl.MakeZeroState(depth_multiplier=1)  # pylint: disable=no-value-for-parameter
    return tl.Serial(
      tl.Branch([], zero_state),
      tl.Scan(tl.GRUCell(n_units=n_units), axis=1, mode=mode),
      tl.Select([0], n_in=2),  # Drop RNN state.
      # Set the name to GRU and don't print sublayers.
      name=f'my_GRU_{n_out}', sublayers_to_print=[]

And put it in my_GRULM. I don’t want print all the code and give out the solution code for other learners, but you could change the original solution by replacing a single line (line 19) with list comprehension of GRU layers with 3 lines:

      my_GRU(d_model, 1024),

and if you print the model, you get:


If you train this model for a while (like 800 steps) you can get some decent results (compared to the pre-trained model of course, and not with current giant models).

Or I could just add Dense layer where I want the transformations of the output:


Of course, the first Dense layer is not logical because Embedding dimension just could have been 1024, but just to illustrate the case if you want the hidden state size of 1024.


Hey @arvyzukai,
Thanks a lot for the detailed answer. I just have a few doubts, and it would help me a lot if you can clear them up as well.

Doubt 1

When you stated this, are you referring to how frameworks implement the different RNN models, or are you saying that it is the only way out? When I made the statement regarding this, I had custom implementations in mind, like when we code from scratch just to learn it (apologies for not stating it before). In that case, I believe, we can easily vary the number of RNN cells during training (be it vanilla RNN, GRU or LSTM), since we have shared weights and averaging for gradients from all the cells. This also included how we can use RNNs for evaluation purposes, for instance, we can do sentiment analysis for sentences of different length, without having to indulge in any padding or trimming steps. And as I stated before, this we can only do when batch-size = 1 for both training and evaluation. Otherwise, we would definitely have to indulge in padding/trimming all the sentences to a uniform length.

What are your thoughts on this? :thinking:

Doubt 2

Now, I wrote this post before solving the Week 2’s Assignment, and after solving the same, I am a little curious about the prediction process, primarily how the predict function works. If, you agree with my previous point, then do you believe that we can rewrite this function, for instance, we start with an empty string, and instead of padding it with say 32 zeros, we only take a single GRU cell, and takes the prediction from it as the 0th character. Then, we right-shift it, takes 2 GRU cells, and takes the prediction from the second GRU cell as the 1st character, and so on. If we follow this methodology, we don’t have to iterate for a fixed number of times (32, in this case), for producing a character at every time-step. For the first character, just 1 time-steps of computation, for the second character, just 2 time-steps of computation and so on. This will save the computation by manifolds.

Off course, if my previous statement is wrong, then I believe, there is no point discussing about this.

Doubt 3

In this statement, did you want to mention the input size (i.e., the dimensionality of the embeddings), instead of the output size, like is it a typo? Because the output size is same as the hidden state size in even other frameworks, if I am not wrong. For instance, here, you can see that Tensorflow produces an error if we try to set the dimensionality of the hidden state different from the output.


Hey @Elemento

First, let me group all flavors of Recurrent neural networks under the same name - RNN. What I’m talking about in this post applies to all of them.

Second thing to note, confusion results from the different words to represent the same thing across different ML communities. One of the best examples of what I mean is - RNN “layer” vs. RNN “cell”, RNN “unit” and RNN “hidden_size” (or “feature size”, “d_model” etc.).
So to make things clear:
(“units” by TensorFlow) === (“n_units” by Trax) === (“hidden_size” by PyTorch) mean the same thing - output size of the RNN. I personally think PyTorch term is the least confusing. For me, the best thing would have been output_size or step_dim_out.

Now, the important part where the confusion arises most of the time - RNN “cell” or RNN “layer” are almost the same thing, except that RNN “cells” operate a dimension lower and cannot take in 3D input, so you would have to take care of that yourself (by for loops or whatever). In addition, PyTorch at least provides a way to pass num_layers for RNN “layer”, which makes a more convenient way for multi-layer RNNs.

I personally find concrete calculations most informative so I usually do that. Here is an example what I mean:

and for 3D Tensor:

The main point is that if you provide the same weights and juggle with input dimensions you receive the same result regardless if you work with RNN “cell” or RNN “layer”. Common misconception is that RNN “layer” is somewhat increases the output of dimensionality compared to RNN “cell”.

So regarding your Doubt 1:

I meant that RNNs (“Cell” and “Layers”) operate on last the dimension (in the example - 2 changes to 4). For sequences RNNs are fed inputs of shape of (batch_size or None x sequence_length x feature_size). Sequence length have no influence on RNN weights’ sizes. I could imagine a toy version of using RNNCell when feature size is a scalar (for example a word is represented by scalar, or a stock price represented by a scalar) and the input is (batch, words) or (batch, prices), then yes - you would have to force each sequence of words or prices to be the same length but I think this use case is more of an exception. I think this is what you meant by custom implementations. By the way, I can imagine this kind of use for some super complicated models (something like tree-like structure where each branch operates on single scalar and these RNNCells branch out or smth but that would be another extreme of exceptional use).

Regarding your Doubt 2:

The way it is implemented in the Assignment I guess is for simplicity and not efficiency. You could take the weights and rewrite the function (and model) for efficiency (changing RNN would not suffice because you need the embedding of the last somewhat randomly sampled input).

Regarding your Doubt 3:

You are correct, I made some typos :slight_smile: including “input size” instead of “output size”. I’m sure I made some typos here too, please correct me if you see some :slight_smile:

Happy New Year.

1 Like

Hey @arvyzukai,
Thanks a lot for the detailed answer. You have completely clear my doubts 2 and 3, and regarding the first one, I believe both of us are correct, based on our perspective. You were talking about weight sizes and their dependence on the sequence length, and I was talking about the possible ways we can implement an RNN for sequences of different length. Frankly, the way I am referring to, is something that can be done, but practically, it is extremely less efficient, so there’s not much use of discussing about it. Thanks a lot once again.



Yes, I have to admit I was too quick to assume that we were talking about multi-feature (and not single feature) sequence inputs for RNNs as how they are mostly used today (and in a context of this NLP course and Title of this thread :slight_smile: ).

I guess the main point to take away is that there’s unfortunate terminology in ML that usually is sometimes a source of confusion - like the “number of RNN cells in an RNN layer” for me does not make much sense in the context of DL Frameworks terminology (class names). And I guess I had to address this point first. Thanks for making me realize that :slight_smile: .


1 Like