# When to use np.dot

I was able to trial and error Exercise 5 on for the first weeks programming assignment called “Propagate” however I dont understand the notations. I know that np.dot referes to matrix multiplication and the * sign refers to element wise multiplication.

When calculating the activation there is sigmoid(w.T*X+b). Here I used np.dot in order to get the correct results.

For the cost function there is y*log(a) + (1-y)*log(1-a). Here all * are actual * signs, np.dot doesent work.

How can I know when to use np.dot and when to use *? Isnt there a cleaner notation to this?

Where in the notebook or the lectures does it ever say this: sigmoid(w.T*X+b)? According to the notational conventions that Prof Ng uses, that is clearly wrong. As you say “*” always indicates elementwise multiply, but a dot product is required there.

I marked some occurrences in red and blue in the image. All have the ‘nothing’ notation, yet as far as I understand they mean different things. I got a 100% on the exercise, however I am not sure I am right. It is even more confusing, because the site says: “compute cost by using np.dot to perform multiplication.” (marked with green), and I only used * for the cost function, not np.dot so I am basically 100% lost…

Well, I claim those cases are different. The first with w^TX +b is using the “no operator means dot product” convention. The second is a fundamentally different case: that’s just a math formula and the quantities there are all scalars, right? So there’s no ambiguity about what that means. It’s pretty clear it’s math, because what does \displaystyle \sum_{i = 1}^{m} mean in python, right?

I grant you that it requires careful attention and he’s kind of “mixing metaphors” a bit here, but still I think it’s pretty clear.

Now the question is how can you express this formula in python:

\displaystyle \sum_{i = 1}^{m}y^{(i)}log(a^{(i)})

if the inputs are given as 1 x m row vectors? I can think of at least two ways:

1. Elementwise multiply of the two vectors followed by a sum.
2. A dot product, but you have to transpose the second one in order for the dimensions to work and to give you a scalar output.

1 x m dotted with m x 1 gives you a 1 x 1 (scalar) output. If you do the transpose the other way, you get a completely different result, which is just a multiplication table. If you add that up, it’s got no relationship to the correct answer. Here’s a thread with some concrete examples that are relevant.

I saw an implementation of the logistic regression’s loss function on Kaggle, as below. Here it doesn’t use np.sum() function.

def loss(h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()


Why does it have to use the np.sum() in our implementation in this course? I don’t see why.

It’s a good point that the cost is defined as the mean of the loss values across the samples in the batch, so you could use the mean function. My guess is that the reason Prof Ng did it with the factor of 1/m instead is that there’s another way to implement this that is way more efficient: use np.dot instead. That way you only have one vectorized operation instead of two: no need for separate steps of multiply and then either sum or mean. The only thing that the mean function saves you is having to write the code to multiply by the scalar value 1/m. It doesn’t save you any computation, since the division by m is still happening in the mean code.

You could also argue from the point of view of code clarity. Using the mean function makes the intent more clear, so the question is whether you care more about that than the performance difference of two versus one vectorized operation.

I understand now. The np.mean() function implicitly contains the ‘np.sum()’ computation, so they are essentially the same.

Another question regarding the forward propagation is that,

Z= W.TX + b

But in Week 4, in the lecture notes, when computing the Z, WX+b is used, instead of W.TX+b. Please see the attachment. Shouldn’t the transpose of W be used consistently?

You’re right that it might be better if things were consistent, but that is Prof Ng’s choice. He uses the convention that standalone vectors are column vectors. So w in the Logistic Regression case is a column vector and that requires the transpose with the way he has defined the sample matrix X. But when it comes time to define the weights for a real NN, he defines them such that a transpose is no longer required. In other words, it’s the LR case that is the outlier here. If you wanted to make them consistent, the answer is not to add the transpose in the NN case: it’s to remove it in the LR case by making w a row vector.

Here’s another thread which discusses this in more detail.

1 Like

So in the NN case, each training example x(i) is still a column vector, as it is in LR, but the weight vector w(i) is a row vector, and the weight vector in LR is a column vector. Is that right?

Yes, that’s right. Prof Ng chooses to stack the transposed w vectors as the rows of the W weight matrices for NNs. This is explained in the lectures and in this other thread, which I think I already linked above.