Hi, community need your help with this topic,
Okay I was going to ask a question but somehow I have changed it to my explanation as I went through it , please read through and I would appreciate your inputs
values greater than 1 will increase exponentially with increase in n.o of layers during forward prop. accumaltion Also as in the slide (attached below) we aren’t using any activation (linear), hence as we go forward with prop for each layer output A [L] = g(Z[L) → (g is none or linear) A [L] will keep on increasing as Z[L] itself uses previous layer outputs with no activations
Z1 = np.dot(W1,X) + b1
Z1 == A1
Z2 = np.dot(W2,A1) + b2
Z2 == A2 (Big Value A2)
Now , final output A2 is large
calculating gradients
dZ2 = A2 - Y (hence dZ2 will be large as A2 was large above)
dW2 = (1/m) * np.dot(dZ2,A1.T) (but in this one A1 is not large as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still large so dW2 will be large only)
db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (db2 will be large because of dZ2)
dZ1 = np.dot(W2.T,dZ2) * A1 (dZ2 was large above , So dZ1 is large?)
dW1 = (1/m) *(np.dot(dZ1,X.T)) (dW1 large because dZ1 large above)
db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True)) (db1 large because dZ1 large above)
Hence all the gradient above will be large (
Exploding gradients
)
so during gradient descent these updates will be huge (Overshoot the cost function?)
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
Now if we are using sigmoid in every layer in forward prop.
values less than 1 will decrease exponentially with increase in n.o of layers during forward prop. ,and we say sigmoid output will be less than one hence values get lower, layer by layer
Z1 = np.dot(W1,X) + b1
A1 = np.sigmoid(Z1)
Z2 = np.dot(W2,A1) + b2
A2 = sigmoid(Z2)
A2 will be smalll
dZ2 = A2 - Y (hence dZ2 will be small as A2 was small )
dW2 = (1/m) * np.dot(dZ2,A1.T) (but in this one A1 may be small as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still small so dW2 will be small only)
db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (db2 will be small because of dZ2 is small)
dZ1 = np.dot(W2.T,dZ2) * A1 (dZ2 was small above , So dZ1 is small?)
dW1 = (1/m) *(np.dot(dZ1,X.T)) (dW1 small because dZ1 small above)
db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True)) (db1 small because dZ1 large above)
so during gradient descent these updates will be small (Slow convergence)
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
What do you all think is this the correct way of of understanding Vanishing / Exploding Gradients?
One more question → This Is Andrew Ng’s slide , can anyone explain why he use identity matrix for weights matrix initialisation?