Week 2 has z=w(T) x+b (see notes on Logistic Regression Cost Function). This is also the case for Week 3 Neural Networks Overview)

However, Week 4 shows z=Wx+b (See Vectorized Implementation and other places in week 4)

Why the difference?

Week 2 has z=w(T) x+b (see notes on Logistic Regression Cost Function). This is also the case for Week 3 Neural Networks Overview)

However, Week 4 shows z=Wx+b (See Vectorized Implementation and other places in week 4)

Why the difference?

1 Like

The parameters of one neuron in one of the layers models are (wi(T),b).

He generalized the function by vectorizing all wi(T) parameters into a single vector W.

The vector W contains all of the parameters wi(T) of a single layer.

W=(w1(T), w2(T), w3(T),â€¦ wn(T), and vectorizing B is unnecessary because B will be equivalent to:

B=(b,b,bâ€¦,b), you will add the same b to every neuron wi(T)*x+b.
so in general W*x+b generalise all equation of wi(T)*x+b

This is a culture of linear algebra.

I understand the vectorizing of the equation. What I do not understand is why we are using W.T in the calculation for the single sample and W for the vectorizing. I would have expected W.T or W be used for both. The focus of the equation is the move from W.T to W.

The definitions of the weights are different in the two cases:

For Logistic Regression, the weights are a vector w with the same dimension as each input sample. It is a choice, but Prof Ng chooses to define all vectors as column vectors. So if we have a vector w of dimension n_x x 1 and a vector x of dimension n_x x 1 and we want to compute:

z = \displaystyle \sum_{i = 1}^{n_x} w_i * x_i + b

as a vector computation, it requires a transpose to get the dot product to work:

z = w^T \cdot x + b

Dotting 1 x n_x with n_x x 1 gives a 1 x 1 or scalar output, which is what we want (for a single sample input).

Once we graduate to real neural networks in Week 3, he gets to redefine things. The weights are now a matrix, because we have a separate weight vector for each output neuron of the layer. He could have chosen to define the W matrix such that a transpose is required, but why make things more messy? Hereâ€™s a thread which discusses the portion of the lecture that explains the structure of the W matrix in Week 3. With Prof Ngâ€™s new definition of the weight matrix, the linear activation becomes:

z = W \cdot x + b

for a single input sample x. Note that b, the bias term, is now a vector, not a scalar, with one value per output neuron of the layer.

It sounds a bit arbitrary. I prefer the latter, but why confuse the matter by being inconsistent. Why not use the latter from the start?

Chuck Walsh

I explained the rationale, but you are of course entitled to your own opinion on the subject. Youâ€™re right that itâ€™s arbitrary, but notation is always arbitrary. Sorry, but Prof Ng is the teacher here, so he gets to be the arbiter and we just have to read what he says and â€śdeal with itâ€ť.