C1W4 (deep learning spec) Why is dA[l] summing all da[l](i)s up in the last layer, not stack them as a vector?


I am referring to the equation at 8min from the video (forward and backward propagation), why in vectorized implementation, we are summing all da[l](i)s as a scalar when computing dA[l] in the last layer?
i are all training examples.

Hi @whitecode

Here, we sum all da^{[l]}(i) across all training examples to compute dA^{[l]} in the last layer because we are averaging the gradients over the entire batch of training examples. This considers the contribution from all examples in batch gradient descent.

