A bit can’t understand in lstm Ct can be x2 if Гu = 1 and Гf = 1?
A couple of points here:
Both \Gamma_u and \Gamma_f are outputs of sigmoid, so they literally can’t be 1, right? Although they could be close to it … And of course those values will be learned. Why would the values learned for the “forget gate” be similar to those of the “update gate”? I guess anything is technically possible, but what would drive things in that direction?
Also note that \tilde{c}^{<t>} is not the same thing as c^{<t-1>} or c^{<t>}, right?
So, can c be around to x2, but it is rarely on practice, because Γu and Γf are regulated by weights, right?
You’re right that the “gate” values are controlled by learned weights. But also note that \tilde{c}^{<t>} is the output of tanh
, right? So given that the range is between -1 and 1, what does that do to your reasoning. Why are you fixated on 2? Was that discussed in the lectures somewhere?
I say about this two parts. As I undersand, that can be both around 1. So, 1+1 = 2
Yes, that could mathematically happen, but it probably doesn’t. Did you read what I said about tanh? So what happens if \Gamma_u is close to 1 and \tilde{c}^{<t>} is close to -1?
So mathematically that result can be all over the place. It will be determined in reality by coefficients that are learned for the “forget” and “update” gates, right? So why would the “forget” and “update” gates end up learning the same result for a given sample? Yes, they each could individually be close to 1 in some cases, but why would they do that on the same sample?
So I still really don’t understand your point and why you think this is a big deal. Yes, it could mathematically happen that c^{<t>} might end up being close to 2 in some cases. Or maybe the thing you can definitely say is that you can’t prove that that can’t happen. So what? Does that cause something bad to happen? It could end up being close to -1 as well. So what?
So, as I understand, we get doubled hidden state embeddings. Is it ok?
But they are just numbers, right? The output of a linear expression. That you feed into other linear expressions or into activation functions. Eventually you feed them to softmax. So what’s the big deal? The domain of the softmax function is (-\infty, +\infty), right?
so, do you mean, that this hidden state will be anywayt transformed to needed number after appying of weights, even if it will be doubled?
I repeat what I just said: they are just numbers. They can be any value between -\infty and +\infty. The weights are learned so that the combination of the weights and hidden state values give good results. What values the individual states take as you go through forward propagation is basically transparent to us. All we care about is that the outputs of the functions (the various \hat{y} values) are useful to us. That’s what the training accomplishes: it learns weights that give useful results.
If you are curious, you could take one of the RNNs that we built in one of the exercises and add instrumentation to the code to analyze the hidden state values.
But I confess that I’m still not understanding why you care about this or why you think it’s a big deal. Explain to me why you care. Now that I’ve explained it, how has that changed your view of how you will design your next RNN?
I just wanted to understand the part that seemed a bit strange for me in this formula. We can not learn math at all and still can successfully design our new RNN
But we understand the math at the level of the principles of forward propagation and backward propagation based on a cost function, right? We can’t predict the exact way the internal hidden states of the network will behave, but we understand the mechanisms by which they are trained.
Of course it is never guaranteed that a given design of a network of any type (FC, CNN, RNN …) will necessarily give a good solution. It depends on lots of things, not limited to the architecture you choose and your data.