What is typical dimensionality of c^<t> in a GRU LM?

am003e · March 13, 2023, 9:31am

In the “GRU (simplified)” and “Full GRU” lecture notes, when discussing the Gamma function, Prof Ng says:

If you have 100-dimensional hidden activation value, then C^t can be 100-dimensional Zai, and so C^t would also be the same dimension and Gamma would also be the same dimension as the other things I’m drawing in boxes. In that case, these asterisks are actually element-wise multiplication.Here, if the gates is 100-dimensional vector, what it is, is really 100-dimensional vector of bits, the value is mostly 0 and 1, that tells you of this 100-dimensional memory cell, which are the bits you want to update.

To help me understand the size of a typical model here, for Ng’s example modeled sentence, “The cat, which already ate …, was full” the x^t values are each one hot word vectors (so like 40k dims each) and the output softmax yhat^t values are also ~40k dims, yet are the c^t or a^t values each only 100 dims? I think we only have this one example so far, so any intuition help is welcome.

Related, what is the typical dimensionality of a^t or c^t if, instead of a word-level language model, we have a character-level language model? I am guessing much higher than 100, to make up for the tiny dictionary for each x^t (100 characters vs like 40k words).
Thanks!

balaji.ambresh · March 15, 2023, 8:00am

Please look at the assignment for the week where you’ll implement LSTM forward and backward passes. There is no formula for the size of the internal state of an RNN cell. It’s about trying different hyperparameter values and picking one that works for your problem.

I have used 2 layers of LSTM with 64 (return_sequences=True) and 128 units with embedding dimension of 128 or 512 and got good performance. Please look at NLP models at tfhub.dev to get better statistics for your problem.

Character level models are rarely used in production since it’s harder to capture the relationships among characters based on all the words present in the corpus. So, I wouldn’t worry too much about that. Word and subword level tokenizations are currently widely in use. That said, your guess is as good as mine in terms of the embedding length to make up for fewer characters that make up the language.

Topic		Replies	Views
Model architecture: Embedding dimension size and GRU number of cells NLP with Sequence Models week-module-2	8	1202	January 3, 2023
Concept behind gates Sequence Models coursera-platform	15	573	December 7, 2022
GRU assignment n_layers argument NLP with Sequence Models week-module-2	4	601	July 18, 2022
C5W1 GRU RNN, activation preserved? Sequence Models coursera-platform	6	548	May 23, 2021
Questions on inputs for GRU model NLP with Sequence Models week-module-2	5	762	March 9, 2023

What is typical dimensionality of c^<t> in a GRU LM?

Related topics