The dimensions of dW

I have a question Dr.Andrew in week 2 used the following formula for dw
dw = X*DZ.T (which has n * 1 dimension) i.e. dim(dw) = transpose(dim(w))
while in week 3 for multiple neuron he used
dW = dZ*X.T ( which has L*N dimension) i.e. dim(dW) = dim(W)
why is that the case?

1 Like

The shape of the gradient of an object is always the same as the base object. In the case of Logistic Regression, the weight vector w is formatted as a column vector. That is just a choice that Prof Ng made and the result is that both w and dw have dimensions (n,1). When we get to full Neural Networks, Prof Ng chooses to orient the weight matrices W^{[l]} so that the transpose is not required in the forward propagation formula. That is explained in more detail in this thread.

1 Like

Thanks very much I got it now
I tried to write the shapes by hand and noticed that in logistic regression Prof Andrew used Z = W.T*X + b for forward propagation but used W for updating weights
W = W -alpha * dW
while in neural network we already made the shap of W as W.T (size of neuron layer, features) and so the formula was z = W.X + b
and so for the update W = W - alpha dW that’s why he used the formula
dW = dZ.X.T

Great observation!

I found this tripped me up in the graded assignment of Week 3, and how.

However, I not found that the convention used is changing through the course, it’s just that I didn’t pay enough attention.

So there are 2 ways of writing the matrix of weights W^{[l]}

One we could call COLUMNWISE. It demands transposition of W^{[l]} before matrix-multiplication on the left with a^{[l-1]}:

One we could call ROWWISE. One uses W^{[l]} directly in matrix-multiplication on the left with a^{[l-1]}.

ROWWISE is the convention of the course:

Checking through the lectures, unless I am mistaken, I have only found the ROWWISE convention. Any use of transposition is used for the single “column weight vector” of the output layer only, i.e. we see w^{[l]^{T}} \times a^{[l-1]} = z^{[l]} only, which is correct.

Maybe the convention should be highlighted when the W matrix is introduced, which seems to be at “Week 3: Shallow Neural Networks / Neural Networks Overview”:

2 Likes

You are right that the convention used everywhere here for real Neural Networks is that the W^{[l]} matrices are arranged “row-wise” in your terminology. The one exception is in the case of Logistic Regression: in that case the weights w are a single vector and Prof Ng chooses to use the convention than any standalone vector will be formatted as a column vector. That was a completely arbitrary choice in this case, but it means that for LR the “linear” activation is:

Z = w^T \cdot X + b

In every other case involving a layer of a real NN, it is:

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}

Of course the convention is that A^{[0]} = X, the input sample matrix, which is arranged with the samples as the columns of X.

3 Likes