I need some help to understand y (hat) = sigma * (w (T) * x + b) model. What does w transpose mean? I understand x is a feature vector, why do we use w transpose?

Thanks!

I need some help to understand y (hat) = sigma * (w (T) * x + b) model. What does w transpose mean? I understand x is a feature vector, why do we use w transpose?

Thanks!

The transpose is needed because Prof Ng chooses the convention that all standalone vectors are formatted as column vectors. So both the weights vector w and the input vector x are n_x x 1 column vectors, where n_x is the number of features (elements) in each input vector.

Then the key is that we need to implement the following mathematical formulas in two steps. First there is the linear combination of w, x and the bias b:

z = \displaystyle \sum_{i = 1}^{n_x} (w_i * x_i) + b

Then we apply the non-linear sigmoid activation function to get the final output of logistic regression:

\hat{y} = \sigma(z)

So then the question is how to express that first linear combination (really “affine” transformation) to compute z using vector operations for efficiency. The easiest way is to write that sum of the products formula as a dot product:

z = w^T \cdot x + b

The way dot products work is that the inner dimensions need to agree. Both w and x are n_x x 1, so if we transpose w we have w^T is 1 x n_x vector. If you then dot 1 x n_x with n_x x 1, you end up with a 1 x 1 or scalar result, which is what we want. If you think about what “dot product” means, it is exactly that sum of the products of w_i * x_i for each pair of elements in the two vectors that is shown in the math formula above. But we need the transpose in order for the operation to work when the vectors have those dimensions.

But notice that then we can take one more step in vectorizing by concatenating m input x vectors to make an input matrix X which is now n_x x m (one column for each sample). Now you can compute all the individual \hat{y} values at once by doing this:

Z = w^T \cdot X + b

So we have 1 x n_x dot n_x x m, which gives us a 1 x m output. Then we get:

\hat{Y} = \sigma(Z)

There is one other thing worth saying here: Prof Ng first shows us Logistic Regression with the idea that you can consider it to be a “trivial” Neural Network that only has an output layer. Next week, he’ll show us how to add more layers to get a real Neural Network. In that case, the weights become matrices W^{[l]} with dimensions n^{[l]} x n^{[l-1]}, where n^{[l]} is the number of output neurons in layer l of the network. In that case, he gets to define the format of the W matrices and for simplicity chooses to orient them such that the transpose is no longer required.

Thank you! So is b in Z (capitalized) formula a 1xm row vector?

No, b (the bias term) is always a scalar in Logistic Regression. That will no longer be true once we get to real Neural Networks in Week 3. Adding a scalar to a 1 x m row vector simply adds the same value to each element of the vector. This is a trivial example of what is called “broadcasting” in numpy. Here’s a thread which gives examples of that.

if a scalar is added to a matrices calculation in an equation, does the scalar need to be added m times?

The meaning of adding (or subtracting or multiply or dividing) a scalar to a matrix or vector is that you perform the operation “elementwise”. The result is a matrix or vector of the same shape with the scalar value added (or whatever the operation is) to each element of the original matrix or vector.

Python is an interactive language. You don’t have to wonder what something does: you can try it and watch what happens.

```
np.random.seed(42)
A = np.random.rand(3,4)
print("A = " + str(A))
b = 1.
print("b = " + str(b))
C = A + b
print("C = " + str(C))
b = -2.
print("b = " + str(b))
D = A * b
print("D = " + str(D))
```

Running that gives this result:

```
A = [[0.37454012 0.95071431 0.73199394 0.59865848]
[0.15601864 0.15599452 0.05808361 0.86617615]
[0.60111501 0.70807258 0.02058449 0.96990985]]
b = 1.0
C = [[1.37454012 1.95071431 1.73199394 1.59865848]
[1.15601864 1.15599452 1.05808361 1.86617615]
[1.60111501 1.70807258 1.02058449 1.96990985]]
b = -2.0
D = [[-0.74908024 -1.90142861 -1.46398788 -1.19731697]
[-0.31203728 -0.31198904 -0.11616722 -1.73235229]
[-1.20223002 -1.41614516 -0.04116899 -1.9398197 ]]
```

Hi Paul, thank you for the example. It’s very helpful!