Deep Learning questions

Week 2 has z=w(T) x+b (see notes on Logistic Regression Cost Function). This is also the case for Week 3 Neural Networks Overview)

However, Week 4 shows z=Wx+b (See Vectorized Implementation and other places in week 4)

Why the difference?

1 Like

The parameters of one neuron in one of the layers models are (wi(T),b).
He generalized the function by vectorizing all wi(T) parameters into a single vector W.
The vector W contains all of the parameters wi(T) of a single layer.

W=(w1(T), w2(T), w3(T),… wn(T), and vectorizing B is unnecessary because B will be equivalent to:

B=(b,b,b…,b), you will add the same b to every neuron wi(T)x+b.
so in general W
x+b generalise all equation of wi(T)*x+b
This is a culture of linear algebra.

I understand the vectorizing of the equation. What I do not understand is why we are using W.T in the calculation for the single sample and W for the vectorizing. I would have expected W.T or W be used for both. The focus of the equation is the move from W.T to W.

The definitions of the weights are different in the two cases:

For Logistic Regression, the weights are a vector w with the same dimension as each input sample. It is a choice, but Prof Ng chooses to define all vectors as column vectors. So if we have a vector w of dimension n_x x 1 and a vector x of dimension n_x x 1 and we want to compute:

z = \displaystyle \sum_{i = 1}^{n_x} w_i * x_i + b

as a vector computation, it requires a transpose to get the dot product to work:

z = w^T \cdot x + b

Dotting 1 x n_x with n_x x 1 gives a 1 x 1 or scalar output, which is what we want (for a single sample input).

Once we graduate to real neural networks in Week 3, he gets to redefine things. The weights are now a matrix, because we have a separate weight vector for each output neuron of the layer. He could have chosen to define the W matrix such that a transpose is required, but why make things more messy? Here’s a thread which discusses the portion of the lecture that explains the structure of the W matrix in Week 3. With Prof Ng’s new definition of the weight matrix, the linear activation becomes:

z = W \cdot x + b

for a single input sample x. Note that b, the bias term, is now a vector, not a scalar, with one value per output neuron of the layer.

It sounds a bit arbitrary. I prefer the latter, but why confuse the matter by being inconsistent. Why not use the latter from the start?

Chuck Walsh

I explained the rationale, but you are of course entitled to your own opinion on the subject. You’re right that it’s arbitrary, but notation is always arbitrary. Sorry, but Prof Ng is the teacher here, so he gets to be the arbiter and we just have to read what he says and “deal with it”.