W3E3 question solution unclear

In q3 the solution is provided as shown in the image. Can anyone explain why dw is calculated this way?


Hey @szelesaron,
The explanation can be given in 2 different ways, one from the point of view of the dimensions of the different variables, and one to really understand it in depth.

Starting with first explanation, dw is supposed to have dimensions (n, 1). X.T has dimensions (n, m) and err has dimensions (m, 1). Thus, the dot product, i.e., np.dot(X.T, err) will have dimensions (n, 1), and voila the explanation is done. However, we can figure out lots and lots of combinations of these variables, which will give us the dimensions as (n, 1), so, this is not of much use.

Now, let’s come to the second explanation. I am assuming that f_w and err makes complete sense to you, since they have been already discussed in a great depth in the lecture videos. err, as I just mentioned has dimensions (m, 1). It contains the error in prediction for each of the examples, and since we have m examples, hence, it is trivial that it will have m values, and hence, the dimensions (m, 1). Also, by looking at the equations from which err is calculated, you can make complete sense of its dimensions.

Now, unlike the double-for-loop solution, i.e., using nested for loops, which you are required to implement in the assignment, and which is also easier to implement, this solution exploits vectorization at 2 different levels, first across the different weights, and second across the different examples. Let’s understand it step-by-step. We have the below equation for calculating the gradient for a single weight (or coefficient):

\frac{\partial J(\mathbf{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x_{j}^{(i)} \tag{3}

Now, if we first vectorize it across all the n weights, since we have n different features, we will get the equation as:

\frac{\partial J(\mathbf{w},b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x^{(i)} \tag{3}

Here, note that the gradient is with respect to the entire weight vector, instead of a single weight, corresponding to a single feature. In other words, the former equation gives you dj_dw_i while the latter gives you dj_dw. Here, I am using the variables as defined in Exercise 3 of the assignment.

Now, we know that (error for the ith example):

err^{i} = (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})

So, the above loss equation reduces to:

\frac{\partial J(\mathbf{w},b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} err^i x^{(i)} \tag{3}

which is nothing but the dot product of the 2 vectors multiplied by a scalar, i.e.

dj_dw = (1/m) * np.dot(X.T, err)

I hope this makes sense to you. But if it doesn’t, you will be glad to know that the assignment wants you to stick with the nested for loop approach as of now, since, this solution involves vectorization at multiple levels, which may appear unfriendly to a lot of learners. Moreover, the solution file is not intended to be provided to the learners in the first place. It’s a bug that the team is working on currently.