Vanishing / Exploding Gradients

Hi, community need your help with this topic,

Okay I was going to ask a question but somehow I have changed it to my explanation as I went through it , please read through and I would appreciate your inputs

values greater than 1 will increase exponentially with increase in n.o of layers during forward prop. accumaltion Also as in the slide (attached below) we aren’t using any activation (linear), hence as we go forward with prop for each layer output A [L] = g(Z[L) → (g is none or linear) A [L] will keep on increasing as Z[L] itself uses previous layer outputs with no activations

Z1 = np.dot(W1,X) + b1
Z1   == A1 
Z2 = np.dot(W2,A1) + b2
Z2  ==  A2  (Big Value A2)

Now , final output A2 is large

calculating gradients

dZ2 = A2 - Y (hence dZ2 will be large as A2 was large above)
dW2 = (1/m) * np.dot(dZ2,A1.T) (but in this one A1 is not large as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still large so dW2 will be large only)
db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (db2 will be large because of dZ2)
dZ1 = np.dot(W2.T,dZ2) * A1 (dZ2 was large above , So dZ1 is large?)
dW1 = (1/m) *(np.dot(dZ1,X.T)) (dW1 large because dZ1 large above)
db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True)) (db1 large because dZ1 large above)

Hence all the gradient above will be large (

Exploding gradients

)
so during gradient descent these updates will be huge (Overshoot the cost function?)
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2

Now if we are using sigmoid in every layer in forward prop.
values less than 1 will decrease exponentially with increase in n.o of layers during forward prop. ,and we say sigmoid output will be less than one hence values get lower, layer by layer

 Z1 = np.dot(W1,X) + b1
A1 = np.sigmoid(Z1)
Z2 = np.dot(W2,A1) + b2
A2 = sigmoid(Z2)

A2 will be smalll

dZ2 = A2 - Y (hence dZ2 will be small as A2 was small )
dW2 = (1/m) * np.dot(dZ2,A1.T) (but in this one A1 may be small as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still small so dW2 will be small only)
db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (db2 will be small because of dZ2 is small)
dZ1 = np.dot(W2.T,dZ2) * A1 (dZ2 was small above , So dZ1 is small?)
dW1 = (1/m) *(np.dot(dZ1,X.T)) (dW1 small because dZ1 small above)
db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True)) (db1 small because dZ1 large above)

so during gradient descent these updates will be small (Slow convergence)
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2

What do you all think is this the correct way of of understanding Vanishing / Exploding Gradients?

One more question → This Is Andrew Ng’s slide , can anyone explain why he use identity matrix for weights matrix initialisation?

Do you have any questions?

just updated, please have a look

It’s much simpler than your presentation.

Exploding gradients happen when the learning rate is too large, and the changes in the weights drive the solution farther away from the values that minimize the cost.

This problem only gets worse if the features are not normalized.

Great work, @faber_soaks! I think you are extending the lecture to also cover the backprop which makes a lot of sense because the topic is about the gradients.

The key thing of the lecture is on how the weights are playing a role in the vanishing/exploding gradient problem (note that weights are not the only player).

The similar things between your analysis and the lecture’s are that both exemplified with the cases of “values > 1” and “values < 1”. The difference here is that, the lecture uses weight initialization to explain them, while in your analysis, you didn’t explain how “> 1” but explained “< 1” with sigmoid. I hope you have realized that you (& the lecture) don’t need a sigmoid to have vanishing gradients.

The overall logic of your two examples are to demonstrate how multiplication of > 1 values results in >> 1 values, and how multiplication of < 1 values results in << 1 values, which makes sense to me :wink:

Not exactly the identity matrix, right? He used 1.5I and 0.5I, not I.

The reason for kI is, perhaps, because it is easy to calculate :smile:. E.g. (1.5I)^{20} = 1.5^{20} I \approx 3325I .

Please also be noted that the lecture emphasizes on weights initialization because it is how we have “values > 1” and “values < 1”. Again, initial weights are not the only source of the two gradients problem, but they are the source problem being illustrated in the lecture.

Cheers,
Raymond

wow thanks for your detailed and great response @rmwkwok