Lecture Notebook: Vanishing Gradients; Confusion

Hey Guys,
In this lecture notebook, I have some small confusions. When we are generating toy values for h and x, why both of them have the same dimensions in terms of the time-steps? Isn’t h supposed to have time-step indexing from 0 to T, and x supposed to have time-step indexing from 1 to T?

np.random.seed(12345)
t = 20
h = np.random.randn(5,t)
x = np.random.randn(5,t)

P.S. - I believe if my above statement is true, then this notebook can be modified to correspond in a better way to what we have learnt in the previous weeks and to the mathematical notation. I would be more than happy to suggest the changes, if so.

Cheers,
Elemento

Hey @Elemento

Actually no. When calculating the gradient, you need 20 hidden states as well as 20 inputs.

Cheers

P.S. I changed my response since I realized I was wrong.

Hey @arvyzukai,
I am a little unsure about this. Let’s say that the number of time-steps (T) = 2. In this case, let’s say we feed h0, h1, h2 as the hidden states. Now, when we will calculate the gradients for updating W_{hh}, we would need \frac{\partial{h_2}}{\partial{h_1}} and \frac{\partial{h_1}}{\partial{h_0}}.

Now, in the first expression, we would need h_1 and in the second expression, we would need h_0, and depending on what activation function, we use to get \hat{y_2} from h_2, I believe we could need h_2 as well. Can you please tell me what I am missing here?

In the notebook, we have taken the proportionality, and not the actual formulation. Since we are simply dropping all the derivatives other than those involving h_i(s), I believe in that case, we would need T hidden states, otherwise T+1, what do you think?

Cheers,
Elemento

Hey @Elemento

I think you are correct on that - one hidden state is missing (the h_0) from the calculations. The way I understand the implementation is that the contribution to the gradient is calculated for each step and the way it is implemented is different from the formula:

\prod_{t\ge i > k} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{t\ge i > k} W_{hh}^T \text{diag} (\sigma'(W_{hh} h_{i-1} + W_{hx} x_i + b_h))

where in the formula h_{i-1} is replaced with h_i in the code. So I think the course creators chose deliberately not to include the \frac{\partial h_1}{\partial h_{0}} as it does not make much difference. Does that makes sense?

I’m not sure about their implementation anyways (I quote the lab “You don’t have to worry about the calculus”). But for the sake of my point:
For example, why do we need \text{diag} instead of just replacing b_h with b_h^T (this way the input to the sigmoid_gradient or \sigma' would be the same vector)? I mean \text{diag} (\sigma'(W_{hh} h_{i-1} + W_{hx} x_i + b_h)) is the same as \sigma'(W_{hh} h_{i-1} + W_{hx} x_i + b_h^T). Is there an explanation why it is done this way?
in the code:
p *= W_hh.T@np.diag(sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h))
is exactly the same as:
p *= W_hh.T@np.squeeze(sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h.T))

Cheers

P.S. in the explanation it says:
“The gradient of the activation function is a vector of size equal to the hidden state size, and the \text{diag} converts that vector into a diagonal matrix.”

while in the code they use np.diag which is the opposite.

Hey @arvyzukai,

I am a little confused as to how it is the opposite? numpy.diag does convert the vector into a diagonal matrix, doesn’t it?

Extract a diagonal or construct a diagonal array (from the documentation).

I am a little confused about this as well. diag creates a diagonal matrix, but if we will replace b_h with b_h^T, won’t that be a completely different matrix (broadcasting), which may or may not be a diagonal matrix, so how come, both results in the same outputs.

Cheers,
Elemento

Hey @Elemento

No, the way it is used in the code is the opposite. It is the

Extract a diagonal or construct a diagonal array (part).

Here is the example from the documentation:

x = np.arange(9).reshape((3,3))

x
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
np.diag(x)
array([0, 4, 8])

Hey @arvyzukai,
But it can do both, right?

np.diag(np.diag(x))
array([[0, 0, 0],
       [0, 4, 0],
       [0, 0, 8]])

Cheers,
Elemento

Yes, it can do both. But that is not the point :slight_smile:

What I mean is that the way it is used is different from the explanation. I don’t see the point why:

  • W_hh@h[:,i] is (5,)
  • and W_hx@x[:,i] is (5,)
  • but b_h is of shape (5, 1)
  • and when we add these dimensions (W_hh@h[:,i]+ W_hx@x[:,i] + b_h).shape the resulting shape is (5, 5).

Then after we apply sigmoid_gradient (which is element wise and does not change the dimensionality) we take the diagonal of that result… Which is exactly the same if we would just have transposed the b_h. As I said, you can try in the code:

p *= W_hh.T@np.diag(sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h))
is exactly the same as:

p *= W_hh.T@np.squeeze(sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h.T))
or
p *= W_hh.T@sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + np.squeeze(b_h))

and my second and third variation does not involve (5, 5) shape anywhere - it is vectors all the way with the same result.

Not only the same result, but the explanation does not make sense (at least as the code goes). Maybe there is some mathematical reason behind that (as I’m not very into the gradient calculus), but my personal opinion is that the explanation does not makes sense.

Cheers

Hey @arvyzukai,

Please bear with me. I wanted to know if you are confused about this?

Hey @Elemento

I’m not. Add these lines to the prod function:

def prod(k):
    p = 1 
    for i in range(t-1, k-2, -1):
        print((W_hh@h[:,i]).shape, (W_hx@x[:,i]).shape, b_h.shape)
        print((W_hh@h[:,i]+ W_hx@x[:,i] + b_h).shape)
        p *= W_hh.T@np.diag(sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h))
    return p

What you get is:

(5,) (5,) (5, 1)
(5, 5)

In contrast try this:

b_h = np.random.randn(5) # two cells above (*not (5, 1))
and :

def prod(k):
    p = 1 
    for i in range(t-1, k-2, -1):
        print((W_hh@h[:,i]).shape, (W_hx@x[:,i]).shape, b_h.shape)
        print((W_hh@h[:,i]+ W_hx@x[:,i] + b_h).shape)
        p *= W_hh.T@sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h)  # no `np.diag`
    return p

What you get is:

(5,) (5,) (5,)
(5,)

and the resulting product exactly the same:

array([1.30439039e-14, 5.51993629e-14, 1.59411481e-04, 1.56147172e-04,
       1.98738351e-04, 1.83329914e-04, 1.71458518e-04, 2.37044012e-12,
       6.46685803e-11, 2.94824675e-03, 4.85764834e-03, 1.17684166e-02,
       1.52422977e-02, 1.72018898e-02, 2.75198971e-08, 3.16257422e-08,
       1.32908920e-01, 2.93197471e-01, 1.03630704e+00, 1.20721291e+00])

Hey @arvyzukai,
The only thing I am able to acquire is confusion unfortunately, as to what is the issue that you are raising, and what is the issue that I am raising, and should we do something about that or not :joy:

Cheers,
Elemento

@Elemento
OK, to summarize:

What I’m saying that this Lecture Notebook has some strange implementation and explanations. The one I’ve raised might be due to some mathematical reasons (Jacobian matrix) and maybe it makes sense with the batch dimension - but the way it is explained and coded now - it does not make sense.

Also, what is not usual (at least) to me, when the partial derivatives are calculated with respect to previous state and not the weights (W_hh, W_hx, b_h and b_x as usually).

And to get back to your question (or confusion) why we are missing one hidden state, I think the course creators deliberately chose not to include the initial hidden state h_0 as there should be 20 values for \frac{\partial{h_i}}{\partial h_{i-1}} and the way these values are calculated uses only the h_i part (which I think would be for t+1 but since the values are random it does not make any difference).

Cheers.

1 Like