Hi, community need your help with this topic,

Okay I was going to ask a question but somehow I have changed it to my explanation as I went through it , please read through and I would appreciate your inputs

values greater than 1 will increase exponentially with increase in n.o of layers during forward prop. accumaltion Also as in the slide **(attached below)** we aren’t using any activation (linear), hence as we go forward with prop for each layer output A [L] = g(Z[L) → (g is none or linear) A [L] will keep on increasing as Z[L] itself uses previous layer outputs with no activations

```
Z1 = np.dot(W1,X) + b1
Z1 == A1
Z2 = np.dot(W2,A1) + b2
Z2 == A2 (Big Value A2)
```

Now , final output A2 is large

calculating gradients

dZ2 = A2 - Y **(hence dZ2 will be large as A2 was large above)**

dW2 = (1/m) * np.dot(dZ2,A1.T) **(but in this one A1 is not large as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still large so dW2 will be large only)**

db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (

**db2 will be large because of dZ2)**

dZ1 = np.dot(W2.T,dZ2) * A1

**(dZ2 was large above , So dZ1 is large?)**

dW1 = (1/m) *(np.dot(dZ1,X.T))

**(dW1 large because dZ1 large above)**

db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True))

**(db1 large because dZ1 large above)**

**Hence all the gradient above will be large** (

Exploding gradients

)

**so during gradient descent these updates will be huge** (Overshoot the cost function?)

W1 = W1 - learning_rate * dW1

b1 = b1 - learning_rate * db1

W2 = W2 - learning_rate * dW2

b2 = b2 - learning_rate * db2

**Now if we are using sigmoid in every layer in forward prop.**

values less than 1 will decrease exponentially with increase in n.o of layers during forward prop. ,and we say sigmoid output will be less than one hence values get lower, layer by layer

```
Z1 = np.dot(W1,X) + b1
A1 = np.sigmoid(Z1)
Z2 = np.dot(W2,A1) + b2
A2 = sigmoid(Z2)
```

**A2 will be small**l

dZ2 = A2 - Y **(hence dZ2 will be small as A2 was small )**

dW2 = (1/m) * np.dot(dZ2,A1.T) **(but in this one A1 may be small as here its the output of first layer only , accumulation hasn’t started yet but as dZ2 is still small so dW2 will be small only)**

db2 = (1/m) *(np.sum(dZ2,axis=1,keepdims=True)) (

**db2 will be small because of dZ2 is small)**

dZ1 = np.dot(W2.T,dZ2) * A1

**(dZ2 was small above , So dZ1 is small?)**

dW1 = (1/m) *(np.dot(dZ1,X.T))

**(dW1 small because dZ1 small above)**

db1 = (1/m) *(np.sum(dZ1, axis=1, keepdims=True))

**(db1 small because dZ1 large above)**

**so during gradient descent these updates will be small** (Slow convergence)

W1 = W1 - learning_rate * dW1

b1 = b1 - learning_rate * db1

W2 = W2 - learning_rate * dW2

b2 = b2 - learning_rate * db2

What do you all think is this the correct way of of understanding Vanishing / Exploding Gradients?

**One more question → This Is Andrew Ng’s slide , can anyone explain why he use identity matrix for weights matrix initialisation?**