Hey @Patrick_Ng,

Let me try to give my 2 cents. I will try to share some insights about either of these gates, and the same insights can be extended to the other one. Let’s say that we pick the forget gate. Let’s consider the equation of forget gate

\Gamma_f = \sigma(W_f[a^{<t-1>}, x^{<t>}] + b_f)

I agree with your statement that after learning W_f is a fixed set of parameters and is shared among all the time-steps, but if we think about it closely, it’s a **matrix**, and the \Gamma_f which is used in the equations along with the other gates to produce the output, is not just dependent on W_f alone, it is dependent on the matrix product of W_f with the input and the previous cell state. So, you see though, W_f (*the matrix*) may be a fixed set of parameters, the matrix product may vary example to example, thereby adjusting the output, i.e., \Gamma_f to be suitable for every example, be it one of the examples from the training set, or one of the examples from the test set (*as long as it is similar to the train set examples*).

In fact, if we try to relate it, it’s similar to how any typical neural network functions. The weights learnt by the network are constant once the training is complete, it is when different inputs are forward propagated through these “fixed” weights, that the network produces different outputs. So, if we try to form an analogy, W_f is the fixed set of weights (*after learning has completed*), a^{<t-1>} and x^{<t>} are the different inputs, and \Gamma_f is the different output (*adjusted accordingly*).

Additionally, one other aspect, which I am always grateful for is the fact that \Gamma_f is a vector, and not a scalar value. This means, we can extend the size of the gates to hold in as much information as we would like to, assuming an ideal scenario of infinite computation.

Let us know if this helps you anyhow, and then, we will discuss further upon.

Cheers,

Elemento