It is quite difficult to understand how the same matrix W_a is used to process x^{t} and a^{t-1} at different time steps to generate the embedding for a^{t}, especially given the potential significant variations in the values of x^{t} and a^{t-1} based on the words and sentences used during training. My confusion may arise from not fully grasping the information encapsulated within a^{t}. Andrew provided an example indicating that the size of a^{t} is 100. Is this dimensionality sufficient to capture the intricate relationships between words in different sentences? I would greatly appreciate it if you could help clarify how exactly W_a manages to capture such extensive dynamics in word relationships across various sentences.

There are several different RNN architectures used in this specialization. Some use a single weight matrix, some use a new weight matrix for each step.

Can you identify which one you are asking about specifically?

The week number and assignment number would be most helfpul.

Apologies for the missing information. I was referring to Week 1, Recurrent Neural Network Model, the slide at @ 13:58.

a(t) would contain the relationships from time t, including influences from all the previous time steps.

So a(0) is only the initial activations.

a(1) includes both a(0) and the influence from t(1)

a(2) includes a(0), a(1), and the influence from t(2)

This leads to this set of equations:

Note that Waa, Wax, and Wya may be quite large,