W4_A1_Video Lecture on Forward & Backward functions

Hi, I have a question regarding the fifth video: Building Blocks of Deep Neural Networks.

Referring to the above screenshot, Professor Andrew Ng introduced that we use da[l] to do backpropagation in between different layers. I didn’t really get why we need da[l] in this case. Based on the previous videos, it seems like we only need dz[l], dw[l] and db[l] to do backpropagation.

Is there something I am missing?

Thanks in advance!

Have you seen 1st the assignment for week 4?

Hi Balaji:

Not yet, but I think I figured it out by watching the following videos for week 4. Thanks!

Hi @Marcia_Ma

In backpropagation, the error at each layer is used to update the weights and biases of the previous layers. The error at the output layer (dz[L]) is calculated using the loss function, while the error at each hidden layer (da[l]) is calculated using the error of the next layer (dz[l+1]) and the weights of the current layer (w[l]). This allows the error to be propagated backwards through the network, allowing the weights and biases to be updated in a way that minimizes the overall error.

da[l] = dz[l] * w[l].T

da[l] is used to calculate dz[l-1] which is used to calculate da[l-1] which is used to calculate dz[l-2] and so on.

da[l] is used to calculate the local gradient of the activations at the l-th layer.

dz[l] is used to calculate the gradient of the loss function with respect to the pre-activation at the l-th layer.

dw[l] and db[l] are used to calculate the gradient of the loss function with respect to the weights and biases of the l-th layer.

dz[l], dw[l], and db[l] are used to update the weights and biases of the l-th layer, but da[l] is necessary in order to calculate them. In other words, da[l] is an intermediate step in the backpropagation process, and it is used to calculate the other gradients that are needed to update the weights and biases.

To sum up, da[l] is used to calculate the error at the current layer, dz[l], which is the gradient of the loss function with respect to the pre-activation of the current layer. This in turn is used to calculate dw[l] and db[l], which are the gradients of the loss function with respect to the weights and biases of the current layer, respectively. So in order to update the weights and biases of the current layer, we need all three gradients dz[l], dw[l] and db[l], which are calculated using da[l].

Hope so you got it

Regards
Muhammad John Abbas

Hi @Marcia_Ma @Muhammad_John_Abbas @balaji.ambresh

Welcome @Marcia_Ma to the Community!

dA^{[L]} should be found to calculate the dz^{[l]}, dw^{[l]}, and db^{[l]} if we didn’t use brief way or another way from chain rule to calculate these variables dz^{[l]}, dw^{[l]}, and\ db^{[l]}

The equations in this image

is only for the last layer to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and the activation function here is softmax if it isn’t softmax these equations will be change as it is brief.

To get the original equation of how to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and from they came is

dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}}
dW^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial W^{[l]}}
dB^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial B^{[l]}}.

That’s called chain rule in the derivatives so that here we remove the step of calculating da^{[l]}

But if the last layer isn’t softmax we should change this equations to calculate the appropriate equations according to the activation function for example if the last layer activation function is sigmoid we should calculate the

dAL = \frac{\partial \mathcal{L^{[L]}}}{\partial A^{[L]}} = -\frac{y^{[1]}}{a^{[1]}} +\frac{1-y^{[1]}}{1-a^{[1]}} ...-\frac{y^{[m]}}{a^{[m]}} +\frac{1-y^{[m]}}{1-a^{[m]}}

and calculate that dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial Z^{[l]}} according to( dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} ) if you want the equation of the dZ^{[l]} of the sigmoid function will be dZ^{[l]} =\frac{\partial\mathcal{L^{[l]}} }{\partial Z^{[l]}} = A^{1}(1-A^{1}) ... A^{m}(1-A^{m})

so concretely we must have dA^{[ for \ each \ layer \ except \ layer \ 0]} to get all ( dZ^{[l]},dW^{[l]}, db^{[l]}) by chain rule

Cheers,
Abdelrahman