As in the video it is told that for m examples the shape of bias is nl* 1 but because of broadcasting it becomes nl* m

since shape of b and db is same, but when we get the shape of db it is nl* m and then average is taken and is converted to nl* 1 . But in the case of b we are copying the value of bias for every example by broadcasting but in db we are taking average ?

Hi @Rupesh. Thank you for this great question. `b`

is a vector of biases is a layer. You can also consider it as another column in `W`

that multiplies an additional element is the vector `x`

with a value of 1. `db`

(and `dW`

) are the updates to the bias vector and weight matrix. For a batch of samples we have one update, which is the average of updates computed for each `b`

and `W`

in the batch. This is why `db`

and `dW`

are reduced in the direction of samples. For a calculated `db`

you add its value to every `b`

in the batch (same for `W`

).

I hope my explanation helped you understand. Please comment with more questions if something is not clear

I got that when we are passing forward propagation we are simply broadcasting it into m column vectors. That means every column is same but when we are back propagating instead of taking a single column we are taking the average just because for every example the derivative of loss function for that example and the bias would be different. Am i right?

I am not sure I understood your description of the process. With `Wx+b`

you get a tensor of size (m, h, w, c). Since `b`

is a vector, internally it is duplicated to match the size of `Wx`

in the summation.

When calculating gradients for back propagation you reduce by averaging the direction of samples. The reason is that you want a single update for a batch of samples. This allows greater smoothing by increasing batch size, and vice versa when you decrease it. I am not certain this was what you meant in your description, but please correct me if I wrong

Thank you I got that