In q3 the solution is provided as shown in the image. Can anyone explain why dw is calculated this way?

Hey @szelesaron,

The explanation can be given in 2 different ways, one from the point of view of the dimensions of the different variables, and one to really understand it in depth.

Starting with first explanation, `dw`

is supposed to have dimensions `(n, 1)`

. `X.T`

has dimensions `(n, m)`

and `err`

has dimensions `(m, 1)`

. Thus, the dot product, i.e., `np.dot(X.T, err)`

will have dimensions `(n, 1)`

, and voila the explanation is done. However, we can figure out lots and lots of combinations of these variables, which will give us the dimensions as `(n, 1)`

, so, this is not of much use.

Now, let’s come to the second explanation. I am assuming that `f_w`

and `err`

makes complete sense to you, since they have been already discussed in a great depth in the lecture videos. `err`

, as I just mentioned has dimensions `(m, 1)`

. It contains the error in prediction for each of the examples, and since we have `m`

examples, hence, it is trivial that it will have `m`

values, and hence, the dimensions `(m, 1)`

. Also, by looking at the equations from which `err`

is calculated, you can make complete sense of its dimensions.

Now, unlike the double-for-loop solution, i.e., using **nested for loops**, which you are required to implement in the assignment, and which is also easier to implement, this solution exploits vectorization at 2 different levels, **first across the different weights, and second across the different examples**. Let’s understand it step-by-step. We have the below equation for calculating the gradient for a single weight (or coefficient):

Now, if we first vectorize it across all the `n`

weights, since we have `n`

different features, we will get the equation as:

Here, note that the **gradient is with respect to the entire weight vector**, instead of a single weight, corresponding to a single feature. In other words, the former equation gives you `dj_dw_i`

while the latter gives you `dj_dw`

. Here, I am using the variables as defined in Exercise 3 of the assignment.

Now, we know that (error for the `ith`

example):

So, the above loss equation reduces to:

which is nothing but the dot product of the 2 vectors multiplied by a scalar, i.e.

dj_dw = (1/m) * np.dot(X.T, err)

I hope this makes sense to you. But if it doesn’t, you will be glad to know that the assignment wants you to stick with the nested `for`

loop approach as of now, since, this solution involves vectorization at multiple levels, which may appear unfriendly to a lot of learners. Moreover, the solution file is not intended to be provided to the learners in the first place. It’s a bug that the team is working on currently.

Regards,

Elemento