I have some fundamental questions

I have watched and re-watched the week 1 videos but I have some confusion regarding how RNN’s are built. I understand that we are passing one word at each time step to the network, but is this network something that has many layers with one node each? If yes, how is that effective? If not, is it more like a standard NN with many layers and many nodes in each layer? How does that solve the variable input length problem?

image

What is inside the highlighted box in this image?

In the most basic case, it looks like this:

TLDR: tanh activation function for an ordinary linear transformation “WX + b” ,the difference is that “X” is the concatenation of x and activation for previous timestep a^{<t-1>}.

Could you please explain why W_aa is a (100, 100) vector in Andrew’s example? I understand that x^(<t>) are one-hot encoded vectors so they’re (dict_size, 1), hence W_ax is (no_of_examples, dict_size), but what about a^(<t>)? Is it not a scalar? Hence W_aa and W_ya are scalars and should be (no_of_examples, 1)?

Because

The size of a^{<t>} is a hyperparameter of the model, the size (100,1) is arbitrary.

W_aa must be 100 x 100 to yield (100, 1) when multiplied by (100, 1) which is a^{<t - 1>}.

If x^{<t>} has 10000 features, i.e. of shape (10000, 1), W_ax must be (100, 10000) so that the multiplication with x^{<t>} yields (100, 1).