Sequence Models Week 1 Quiz

Hi there! Is there anyone who can fully explain the logic of this question? I did not fully understand.

Hello,

Gated recurrent unit is a modification to the RNN hidden layer which makes it much better at capturing long-range connections and helps a lot with the vanishing gradient problems.

the c^t equation in the image the activation function applied to the parameter W_a times the activations for a previous time sediment the current input and then plus the bias.

GRU unit is going to have a new variable called C, which stands for cell memory. What the memory cell do is it will provide a bit of memory.

So c^t is at time t, the memory cell will have some value c of t. GRU unit will actually output an activation value a of t that’s equal to c of t.

So the equation mentioned in the image governs the computation of GRU unit.

The gamma u here acts as gate for this memory cell sequence. Gamma_u this gate value as being always 0 or 1.

Although in practice, your computer with a sigmoid function applied to this. Remember that the sigmoid function looks like this, as value is always between 0 and 1. For most of the possible ranges of the input, the sigmoid function is either very, very close to 0 or very, very close to 1.

For intuition, think of Gamma as being either 0 or 1 most of the time.

The job of the gate, that is gamma u, is to decide when do you update this value.

So basically in the activation sequence gamma u makes you to decide when to update the memory sequence based on 0 or 1. when gamma u is equal to 0, it is telling the memory sequence to not update and remember the initial value.

if gamma u is equal to zero, so it’s just setting C^t equal to the old value even as you scan the model.

So in this question when alice proposes to simply the GRU by removing ru=0 or gamma u equal to 0 for a tilmestep, the gradient back propagate through that timestep with decay as it has been told not to update and will remember its original value or maintaining the initial value.

Where as Betty proposes to keep the gamma u equal to 1 that the gradient can back propagate through that tilmestep without much decay as the memory cell c^t will update at every time step.

So based on this which answer would you choose for this question,??

So this could be the answer?
Betty’s model (removing gate r), because if gate u 1 for a timestep, the gradient can propagate back through that timestep without much decay.

The question asks which of the following models is more likely to work without vanishing gradient problems even when training on very long input sequences?

by choosing gamma u equal 1 always do you think you are going to make the model remember its initial value at every tilmestep without decay?

to get this questions answer you also need to consider question statement, tell me first which two statements can be clearly removed by just reading the question?

Ah sorry, Of course, Gamma u must always be 0 for the initial value to be remembered. For c^t to be equal to c^t-1. sorry for the typo.
Betty’s model (removing gate r), because if gate u 0 for a timestep, the gradient can propagate back through that timestep without much decay.
I actually wanted to write this

this answer will not hold true as the question mentions betty’s proposal was for removing gate r if gate u is 1. so this is again wrong. remember I told you the questions also holds some part of the answer.

this you got it right. I think now you will choose the right answer :slight_smile:

Don’t be behind Betty :joy: :wink:

Regards
DP

Actually, I should have written 0 instead of 1 in my first answer for gamma u. It was my fault. So I think the answer is the third option. Am i thinking wrong?

you are not reading my response properly. By the third option mean Betty had proposed model of removing gamma r where gate u =0, but based on the question Betty had proposed model of removing gamma r then the gamma r is equal to 1. So the third option is incorrect.

Read my previous comment again.

Tell me which will be right answer.

First remove the two options based on the question mentioning both of their proposed model. then the last two option left should instantly give you the right answer.

Alice proposal was to simplify GRU by always removing the gamma u, i.e. setting gamma u = 0, where as

Betty’s proposal was to simplify GRU by removing gamma r i.e. setting gamma r = 1 always

so based on these two sentences you second and third options is incorrect as these contrast to their proposed model.

You are left with first and fourth option, now tell me which is the answer.

Sorry :slight_smile: Gamma u must be 0, so Alice chooses a more suitable model. My understanding is that setting the reset gate to 1 will preserve the previous value. Did I get right? So second option?

no resetting gate to 0 will preserve the previous value. remember Prof Andrew Ng video, when gamma u equal 0 it is telling at each timestep not to update the value.

I will share the particular video Where Prof Ng explains about this, listen it again.

It is not about Alice or Betty only, it is matching their proposed model with the option given.

What is correct answer?? Only 1st and fourth options holds true statement according to the question given. Now choose the right answer.

Regards
DP

Gate u controls the update vector, and if set to 0, past gradients can be propagated back with less distortion.
If gate r is always set to 0, only the current input is used, regardless of past information. When r is 0, it can help avoid the gradient disappearance problem even when training on long input sequences. So I think the answer is the first option.

1 Like

Yes at last you got it, more than distortion.

the main difference between gamma u and gamma r is it significance in activation of the sequence at each timestep.
gamma u basically will tell the model when to update the parameter and if you choose 0, you are telling model not to update while propagating without much decay where as gamma r is the relevance of choosing 1 always (based on the sentence in question) to update the sequence if current value is in sequence with previous sequence memory and that is why betty’s model is not accepted. Because the model wants to work without vanishing gradient problems.

Regards
DP

1 Like

Thank you for being patient with me :smiley:

1 Like

Happy to Help!!!

Keep learning!!!

Regards
DP

I still don’t understand. This question makes no sense to me. I did solve it correctly, but only by eliminating all the wrong answers. I still have no idea how the correct answer makes any sense at all.

My solution

I remembered from the lectures that the “r” gate is the optional gate that makes things better. Also, just looking at the formulas, setting the “u” gate to 0 makes no sense, since according to the formulas, c would always be exactly c and the network would never actually do anything besides just always outputting the first ever C value it saw (the initialization value)

So the answer is either option 3 or option 4. Now, looking at the questions, it talks about the “u” gate being set to either 0 or 1. This seems incredibly random. I figured out that if the “u” gate is 1, then the activation a is overwritten, so probably the “u” gate needs to be 0 in order for the activation not to get overwritten and therefore for the gradient to propagate

The Problem

But jeez, the answers to the question were confusing. The question talks about setting the “u” gate to 0 and the corresponding answers talk about the gradient being able to propagate because of the “r” gate being a certain value for a time step? What? No! The reason setting the “u” gate to 0 is a bad idea is because it messes up the whole formula. It won’t even work on short input sequences, nevermind the longer ones. Setting the “r” gate to 1 at least keeps the formula mostly intact.

And I understand you maybe wanted to test two things at once:

  1. whether we know which gate is more important, “u” or “r”
  2. whether we understand when the activation would be overwritten

But I feel like it was a very confusing thing to ask.