Shape of bias of b and db

As in the video it is told that for m examples the shape of bias is nl* 1 but because of broadcasting it becomes nl* m
since shape of b and db is same, but when we get the shape of db it is nl* m and then average is taken and is converted to nl* 1 . But in the case of b we are copying the value of bias for every example by broadcasting but in db we are taking average ?

Hi @Rupesh. Thank you for this great question. b is a vector of biases is a layer. You can also consider it as another column in W that multiplies an additional element is the vector x with a value of 1. db (and dW) are the updates to the bias vector and weight matrix. For a batch of samples we have one update, which is the average of updates computed for each b and W in the batch. This is why db and dW are reduced in the direction of samples. For a calculated db you add its value to every b in the batch (same for W).
I hope my explanation helped you understand. Please comment with more questions if something is not clear

1 Like

I got that when we are passing forward propagation we are simply broadcasting it into m column vectors. That means every column is same but when we are back propagating instead of taking a single column we are taking the average just because for every example the derivative of loss function for that example and the bias would be different. Am i right?

I am not sure I understood your description of the process. With Wx+b you get a tensor of size (m, h, w, c). Since b is a vector, internally it is duplicated to match the size of Wx in the summation.

When calculating gradients for back propagation you reduce by averaging the direction of samples. The reason is that you want a single update for a batch of samples. This allows greater smoothing by increasing batch size, and vice versa when you decrease it. I am not certain this was what you meant in your description, but please correct me if I wrong

Thank you I got that