Prof Ng mentioned that dw is a n by 1 vector. However, he arranged it such that it’s 1/m [x[1]dz[1]…x[m]dz[m]]. Isn’t this a 1 by m vector?

The notation may be a bit confusing there, but look again at his explanation and then the formula with this in mind:

The gradient of w (that is dw) is the average of the gradient on each of the m samples. That’s what the factor of 1/m and the sum is about there. Then notice that each of those gradients on a sample is an n x 1 vector. So if you add up a set of vectors of that shape, the result is also a vector of the same shape.