In the “GRU (simplified)” and “Full GRU” lecture notes, when discussing the Gamma function, Prof Ng says:
If you have 100-dimensional hidden activation value, then C^t can be 100-dimensional Zai, and so C^t would also be the same dimension and Gamma would also be the same dimension as the other things I’m drawing in boxes. In that case, these asterisks are actually element-wise multiplication.Here, if the gates is 100-dimensional vector, what it is, is really 100-dimensional vector of bits, the value is mostly 0 and 1, that tells you of this 100-dimensional memory cell, which are the bits you want to update.
To help me understand the size of a typical model here, for Ng’s example modeled sentence, “The cat, which already ate …, was full” the x^t values are each one hot word vectors (so like 40k dims each) and the output softmax yhat^t values are also ~40k dims, yet are the c^t or a^t values each only 100 dims? I think we only have this one example so far, so any intuition help is welcome.
Related, what is the typical dimensionality of a^t or c^t if, instead of a word-level language model, we have a character-level language model? I am guessing much higher than 100, to make up for the tiny dictionary for each x^t (100 characters vs like 40k words).
Thanks!