Can I ask why the vectorized implementation for dw = dz* a.T has (1/m), the left hand side formula does not have 1/m, but the vectorized version has 1/m, is there any thing I missed, the two sides do not seem equivalent.

Both are correct but 1/m is just a normalizing technique: divided by the total number of samples.

Thanks for the reply. so does it mean they are both βusableβ as derivative for backpropgation, even they are not exactly same?

Yes.

But the main point to use 1/m is to make the numbers small, hence making the later calculation easy. You can try without that averaging term but large number β large calculation β difficult to do β more time, resources, and memory required.

1 Like