Is Ct doubling?

someone555777 · May 22, 2023, 7:26pm

A bit can’t understand in lstm Ct can be x2 if Гu = 1 and Гf = 1?

paulinpaloalto · May 22, 2023, 7:47pm

A couple of points here:

Both \Gamma_u and \Gamma_f are outputs of sigmoid, so they literally can’t be 1, right? Although they could be close to it … And of course those values will be learned. Why would the values learned for the “forget gate” be similar to those of the “update gate”? I guess anything is technically possible, but what would drive things in that direction?

Also note that \tilde{c}^{<t>} is not the same thing as c^{<t-1>} or c^{<t>}, right?

someone555777 · August 19, 2023, 10:27am

So, can c be around to x2, but it is rarely on practice, because Γu and Γf are regulated by weights, right?

paulinpaloalto · August 19, 2023, 3:22pm

You’re right that the “gate” values are controlled by learned weights. But also note that \tilde{c}^{<t>} is the output of tanh, right? So given that the range is between -1 and 1, what does that do to your reasoning. Why are you fixated on 2? Was that discussed in the lectures somewhere?

someone555777 · August 20, 2023, 11:45am

I say about this two parts. As I undersand, that can be both around 1. So, 1+1 = 2

paulinpaloalto · August 20, 2023, 2:51pm

Yes, that could mathematically happen, but it probably doesn’t. Did you read what I said about tanh? So what happens if \Gamma_u is close to 1 and \tilde{c}^{<t>} is close to -1?

So mathematically that result can be all over the place. It will be determined in reality by coefficients that are learned for the “forget” and “update” gates, right? So why would the “forget” and “update” gates end up learning the same result for a given sample? Yes, they each could individually be close to 1 in some cases, but why would they do that on the same sample?

So I still really don’t understand your point and why you think this is a big deal. Yes, it could mathematically happen that c^{<t>} might end up being close to 2 in some cases. Or maybe the thing you can definitely say is that you can’t prove that that can’t happen. So what? Does that cause something bad to happen? It could end up being close to -1 as well. So what?

someone555777 · August 20, 2023, 5:18pm

So, as I understand, we get doubled hidden state embeddings. Is it ok?

paulinpaloalto · August 20, 2023, 6:54pm

But they are just numbers, right? The output of a linear expression. That you feed into other linear expressions or into activation functions. Eventually you feed them to softmax. So what’s the big deal? The domain of the softmax function is (-\infty, +\infty), right?

someone555777 · August 20, 2023, 7:00pm

so, do you mean, that this hidden state will be anywayt transformed to needed number after appying of weights, even if it will be doubled?

paulinpaloalto · August 20, 2023, 7:11pm

I repeat what I just said: they are just numbers. They can be any value between -\infty and +\infty. The weights are learned so that the combination of the weights and hidden state values give good results. What values the individual states take as you go through forward propagation is basically transparent to us. All we care about is that the outputs of the functions (the various \hat{y} values) are useful to us. That’s what the training accomplishes: it learns weights that give useful results.

If you are curious, you could take one of the RNNs that we built in one of the exercises and add instrumentation to the code to analyze the hidden state values.

paulinpaloalto · August 20, 2023, 7:13pm

But I confess that I’m still not understanding why you care about this or why you think it’s a big deal. Explain to me why you care. Now that I’ve explained it, how has that changed your view of how you will design your next RNN?

someone555777 · August 20, 2023, 7:16pm

I just wanted to understand the part that seemed a bit strange for me in this formula. We can not learn math at all and still can successfully design our new RNN

paulinpaloalto · August 20, 2023, 7:19pm

But we understand the math at the level of the principles of forward propagation and backward propagation based on a cost function, right? We can’t predict the exact way the internal hidden states of the network will behave, but we understand the mechanisms by which they are trained.

Of course it is never guaranteed that a given design of a network of any type (FC, CNN, RNN …) will necessarily give a good solution. It depends on lots of things, not limited to the architecture you choose and your data.

Topic		Replies	Views
Week 1 - Quiz Problem Sequence Models week-1	1	294	January 20, 2024
Course 5, week 1: How is it that -- because the GRU update gate is usually close to 0 -- we do not have a vanishing gradient problem? Sequence Models	5	559	June 26, 2022
Quizz C5-W1 Update and forget gate/ Γu and 1-Γu Sequence Models	2	671	December 13, 2022
GRU Gates, c<t> vs a<t> Sequence Models	1	474	May 23, 2023
Quiz C5 W1 Q9 Update and forget gate: Γu and 1-Γu Sequence Models	2	590	May 2, 2023

Is Ct doubling?

Related topics