C1W4 (deep learning spec) Why is dA[l] summing all da[l](i)s up in the last layer, not stack them as a vector?


I am referring to the equation at 8min from the video (forward and backward propagation), why in vectorized implementation, we are summing all da[l](i)s as a scalar when computing dA[l] in the last layer?
i are all training examples.

I have not deeply thought about this, so forgive me I am wrong :frowning:

Thank you!

Hi @whitecode

Here, we sum all da^{[l]}(i) across all training examples to compute dA^{[l]} in the last layer because we are averaging the gradients over the entire batch of training examples. This considers the contribution from all examples in batch gradient descent.

Hope this helps, feel free to ask if you need further assistance!

1 Like