I’ve been running through each lab and re-typing everything in a new Jupyter notebook to see if I can understand the rationale of what I am doing. I was understanding what was going on in the code for every single one, until I got to the Multiple Linear Regression one. Specifically, my brain kind of froze when I tried to process how the compute_gradient function was creating the values for dj_dw. From there until the end of the lab, I wasn’t able to really have a solid grasp of the logic of what each line was precisely doing. My question is, are we expected to understand the logic of it as we go through these optional labs? If not, will we eventually get to a point where the understanding will come naturally? Thanks.

Hello Andres, in my experience, there were a lot of times something I could not understand at the first time became understood after a period of time. It could be just in the next morning when I was completely recharged, or after a year of experience that I had read and thought some relevant topics a lot of times. Certainly I believe, even if you do not get the idea today, you will get it soon enough. So, be more relaxed!

For exactly `dj_dw`

, to be frank, you will meet with it times and times again in this course because it is part of the gradient descent which is the core of training a neural network. However, there are 3 good news. (1) in practice, when you build your ML model using Tensorflow, you don’t actually need to code dj_dw yourself, (2) we are here to help, so you may actually search this community for relevant threads and if no luck, open your own and ask your specific question like you did, and (3) most people can finally understand it.

So, after sharing all the facts with you, your own expectation matters the most, but since you have asked your specific question, I believe you are not going to just rely on Tensorflow, and understanding this yourself is cool because it will finally help you better train your future NN models. Also, being able to translate concepts into code is a critical part as well.

Now, your question. What I am going to do is to walk through the text description here once. I think this approach is good because this is the exact process you are going to need to repeat in every other optional labs and assignments, so let’s get started.

These maths formulae talk about the gradient descent algorithm, the idea is to repeat updating w_j and b. For example, we are replacing the original value of w_j by the orignal value w_j minus \alpha times \frac{\partial{J}}{\partial{w_j}}. If you can follow me, the last sentence is exactly what we need to implement for updating w_j, but before we can really do this, the one last missing piece of info is how to get \frac{\partial{J}}{\partial{w_j}} which is just `dj_dw`

. Luckily, the text description also covered that:

and from the first formula it is nothing more than calulcating the **(A) error** ( which is f(x^{(i)}) - y^{(i)} ) then multiplying it with x^{(i)} from the first sample ( which is i=0 ) to the last sample (which is i=m-1), and **(B) summing the results up** and lastly **(C) dividing** the summation by m to get the **average** value over all samples. If you also can follow me, the sum/minus/multiply/divide algebra I am describing here is what we are going to implement! Let’s look at the code:

First, in line 18, it is a for-loop to go over all samples because `i`

runs from `0`

to `m-1`

, whatever happens inside the for-loop (as indicated by the indentation levels) are per-sample actions.

in line 19, we calculate the error first echoing my (A) above. Here the code uses a `np.dot`

to save a for loop over each feature, although it still uses a for loop over each feature in line 20 to demonstrate how we go over feature-by-feature to **accumulate** `dj_dw`

. This way of accumulating numbers is our usual tricks when we want to sum a series of results up (which is echoing my (B)). Outside of the loop, we set the accumulator variables to be zero (line 15, 16), then inside the loop, we keep adding new values into the accumulators, by doing this we are essentially implementing, using code, the summation sign in the maths formula. Lastly, for my (C), the code equivalence is line 23 which is what going to be returned in line 26.

The same logic works for `dj_db`

and please try to think through it yourself.

Lastly, the use of `np.dot`

is what we called a vectorization way to do what a for-loop can do. Although the vectorization approach is more efficient, a for-loop is more explicit and easier for learners who are not familiar with vector algebra. Please don’t worry about `np.dot`

for now but you may always watch the vectorization videos in C1 W2 again to refresh your memory about it.

Good luck, and keep it up Andres.

Raymond

Hey Raymond, I appreciate the detailed answer.

Indeed, after I woke up today, I feel much more energized and feel like processing the information to be much easier.

I guess I should be more specific when I ask my question.

My question is why dj_dw is an array containing n “spots”?

Since we are adding the error*X[i,j] to dj_dw[j] in every j iteration, dj_dw will contain perhaps 4 different values.

When I first coded my univariate linear regression model, dj_dw was a single value that accumulated all the errors*x over for loop lasting m times.

To be honest, as I type this answer, I am starting to realize that the gradient descent algorithm for multiple linear regression uses **wj**, which makes sense for why there would be n spots for dj_dw- one spot for an iteration j.

Each w value (ranging from 0-3, given 4 different features) will have its own partial derivative calculated, and then stored in dj_dw separately, totaling 4.

After we updated all values m times, we divide each of the values in the arrays by m.

Thanks for helping me work through it. What seemed impossible yesterday seems to clear today.

Please let me know if my rationale is incorrect though, haha.

It is perfect, Andres. A good start for the day, isn’t it?

Have a great rest of the day!

Raymond