What is typical dimensionality of c^<t> in a GRU LM?

In the “GRU (simplified)” and “Full GRU” lecture notes, when discussing the Gamma function, Prof Ng says:

If you have 100-dimensional hidden activation value, then C^t can be 100-dimensional Zai, and so C^t would also be the same dimension and Gamma would also be the same dimension as the other things I’m drawing in boxes. In that case, these asterisks are actually element-wise multiplication.Here, if the gates is 100-dimensional vector, what it is, is really 100-dimensional vector of bits, the value is mostly 0 and 1, that tells you of this 100-dimensional memory cell, which are the bits you want to update.

To help me understand the size of a typical model here, for Ng’s example modeled sentence, “The cat, which already ate …, was full” the x^t values are each one hot word vectors (so like 40k dims each) and the output softmax yhat^t values are also ~40k dims, yet are the c^t or a^t values each only 100 dims? I think we only have this one example so far, so any intuition help is welcome.

Related, what is the typical dimensionality of a^t or c^t if, instead of a word-level language model, we have a character-level language model? I am guessing much higher than 100, to make up for the tiny dictionary for each x^t (100 characters vs like 40k words).
Thanks!

Please look at the assignment for the week where you’ll implement LSTM forward and backward passes. There is no formula for the size of the internal state of an RNN cell. It’s about trying different hyperparameter values and picking one that works for your problem.

I have used 2 layers of LSTM with 64 (return_sequences=True) and 128 units with embedding dimension of 128 or 512 and got good performance. Please look at NLP models at tfhub.dev to get better statistics for your problem.

Character level models are rarely used in production since it’s harder to capture the relationships among characters based on all the words present in the corpus. So, I wouldn’t worry too much about that. Word and subword level tokenizations are currently widely in use. That said, your guess is as good as mine in terms of the embedding length to make up for fewer characters that make up the language.

1 Like