W4_A1_Video Lecture on Forward & Backward functions

Marcia_Ma · January 14, 2023, 4:53pm

Hi, I have a question regarding the fifth video: Building Blocks of Deep Neural Networks.

Referring to the above screenshot, Professor Andrew Ng introduced that we use da[l] to do backpropagation in between different layers. I didn’t really get why we need da[l] in this case. Based on the previous videos, it seems like we only need dz[l], dw[l] and db[l] to do backpropagation.

Is there something I am missing?

Thanks in advance!

balaji.ambresh · January 14, 2023, 5:57pm

Have you seen 1st the assignment for week 4?

Marcia_Ma · January 14, 2023, 8:00pm

Hi Balaji:

Not yet, but I think I figured it out by watching the following videos for week 4. Thanks!

Muhammad_John_Abbas · January 14, 2023, 8:39pm

Hi @Marcia_Ma

In backpropagation, the error at each layer is used to update the weights and biases of the previous layers. The error at the output layer (dz[L]) is calculated using the loss function, while the error at each hidden layer (da[l]) is calculated using the error of the next layer (dz[l+1]) and the weights of the current layer (w[l]). This allows the error to be propagated backwards through the network, allowing the weights and biases to be updated in a way that minimizes the overall error.

da[l] = dz[l] * w[l].T

da[l] is used to calculate dz[l-1] which is used to calculate da[l-1] which is used to calculate dz[l-2] and so on.

da[l] is used to calculate the local gradient of the activations at the l-th layer.

dz[l] is used to calculate the gradient of the loss function with respect to the pre-activation at the l-th layer.

dw[l] and db[l] are used to calculate the gradient of the loss function with respect to the weights and biases of the l-th layer.

dz[l], dw[l], and db[l] are used to update the weights and biases of the l-th layer, but da[l] is necessary in order to calculate them. In other words, da[l] is an intermediate step in the backpropagation process, and it is used to calculate the other gradients that are needed to update the weights and biases.

To sum up, da[l] is used to calculate the error at the current layer, dz[l], which is the gradient of the loss function with respect to the pre-activation of the current layer. This in turn is used to calculate dw[l] and db[l], which are the gradients of the loss function with respect to the weights and biases of the current layer, respectively. So in order to update the weights and biases of the current layer, we need all three gradients dz[l], dw[l] and db[l], which are calculated using da[l].

Hope so you got it

Regards
Muhammad John Abbas

AbdElRhaman_Fakhry · January 15, 2023, 12:09am

Hi @Marcia_Ma @Muhammad_John_Abbas @balaji.ambresh

Welcome @Marcia_Ma to the Community!

dA^{[L]} should be found to calculate the dz^{[l]}, dw^{[l]}, and db^{[l]} if we didn’t use brief way or another way from chain rule to calculate these variables dz^{[l]}, dw^{[l]}, and\ db^{[l]}

The equations in this image

is only for the last layer to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and the activation function here is softmax if it isn’t softmax these equations will be change as it is brief.

To get the original equation of how to calculate dz^{[l]}, dw^{[l]} and , db^{[l]} and from they came is

dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}}
dW^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial W^{[l]}}
dB^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} * \frac{\partial \mathcal{ Z^{[l]}} }{\partial B^{[l]}}.

That’s called chain rule in the derivatives so that here we remove the step of calculating da^{[l]}

But if the last layer isn’t softmax we should change this equations to calculate the appropriate equations according to the activation function for example if the last layer activation function is sigmoid we should calculate the

dAL = \frac{\partial \mathcal{L^{[L]}}}{\partial A^{[L]}} = -\frac{y^{[1]}}{a^{[1]}} +\frac{1-y^{[1]}}{1-a^{[1]}} ...-\frac{y^{[m]}}{a^{[m]}} +\frac{1-y^{[m]}}{1-a^{[m]}}

and calculate that dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial Z^{[l]}} according to( dZ^{[l]} = \frac{\partial \mathcal{L^{[l]}} }{\partial A^{[l]}}* \frac{\partial \mathcal{A^{[l]}} }{\partial Z^{[l]}} ) if you want the equation of the dZ^{[l]} of the sigmoid function will be dZ^{[l]} =\frac{\partial\mathcal{L^{[l]}} }{\partial Z^{[l]}} = A^{1}(1-A^{1}) ... A^{m}(1-A^{m})

so concretely we must have dA^{[ for \ each \ layer \ except \ layer \ 0]} to get all ( dZ^{[l]},dW^{[l]}, db^{[l]}) by chain rule

Cheers,
Abdelrahman

Topic		Replies	Views
Back propagation 1 box Neural Networks and Deep Learning week-4	3	127	May 29, 2024
Week4 - Building Blocks of Deep Neural Networks Neural Networks and Deep Learning	3	557	October 27, 2021
Backpropagation week 3 vs week 4 Neural Networks and Deep Learning	4	551	August 5, 2022
week-4-Backpropagation Neural Networks and Deep Learning week-4	8	23	November 16, 2024
Forward and backward propagation quiz question discussion Neural Networks and Deep Learning	3	368	September 14, 2023

W4_A1_Video Lecture on Forward & Backward functions

Related topics