Regression with a perceptron - gradient descent

We covered gradient descent, and how derivatives and cost functions make it work.

I’m not the video (title) and not really understanding why he picked those certain derivatives.

I understand y and yHat are part of L(y,yhat) but no idea why he picked different parts out.

Sorry, I do not understand your question clearly.

I’m also not sure what you are really asking here, but if the question is why is he computing the partial derivatives of L w.r.t. w_1, w_2 and b it is because those are the “variables” here. Those are the parameters of the regression, meaning that those are what we need to compute the optimal values for, right? Normally when you are doing calculus, you are taking derivatives w.r.t. to the input variable. So if you have:

y = f(x)

then you care about

f'(x) = \displaystyle \frac {dy}{dx}

But here we are looking at things in a different way. Our goal is really to define the function, not the inputs. The inputs are either our training data or the actual real world data that we want to use in order to make predictions. So we treat those as essentially constants and are really trying to use gradient descent (using the partial derivatives) to compute the optimal values for the parameters w_1, w_2 and b. In other words, the values of those parameters which will minimize the loss L. The point of L is that it measures the quality of our predictions.

Think of it this way: what \displaystyle \frac {\partial L}{\partial w_1} tells us is what effect a small change of w_1 will have on the value of L for all the samples in our input dataset. Then we average those gradients across all the training inputs and that’s what drives Gradient Descent for w_1. And similarly for w_2 and b.

2 Likes

Wonderful, thank you.

1 Like