Hello everyone,
I would like to ask. In this lecture at 05:05 the dw
computation shouldn’t be like this:
dw = \frac{1}{m} *dz^TX to be inline with the computation on the next line? Or it should be like dw = \frac{1}{m} *X^Tdz^T ?
If any of the above is correct, then could you please explain the computation?
1 Like
To understand this, you need to carefully track the shapes of the various objects which are shown in the way the matrices and vectors are drawn out on that slide. The first thing to be careful about is that on the left side of the whiteboard, the computations are handling one sample at a time (they use lower case z), but on the right side it is handling all the samples at once (vectorized, which is the whole point here) and it uses capital Z.
Here are the dimensions:
X is n_x x m, where n_x is the number of features and m is the number of samples.
dZ is the gradient of Z, so it is 1 x m.
dw is the gradient of w so it has the same dimensions as w which are n_x x 1.
The gradient formula as Prof Ng gives it is this:
dw = \displaystyle \frac {1}{m} X \cdot dZ^T
Note that I added the “dot product” operator there just to be clear. The notation Prof Ng uses is that he just writes the operands adjacent with no explicit operator when he means “dot product” style multiply. When he wants to write “elementwise” multiply, he uses “*” to indicate that.
So dZ^T will be m x 1 and the dimensional analysis on Prof Ng’s formula is:
n_x x m dotted with m x 1 gives n_x x 1 result, which is the correct dimensions for dw.
If you try your versions, the dimensions do not match for a dot product:
dZ^T \cdot X would be m x 1 dotted with n_x x m. That doesn’t work.
X^T \cdot dZ^T would be m x n_x dotted with m x 1 and that doesn’t work either.
1 Like
Thank you for your reply, sir. I agree with everything you have written. I handwrote it and concluded on the same.
1 Like