Hi everyone,
I’m working through the derivation of the cost function in linear regression, specifically where W and X are vectors in the equation WX + B. I understand how differentiation works with scalars, but I’m having trouble grasping how it applies to vectors.
Could someone explain how vector calculus is applied here? Any resources or explanations would be greatly appreciated! Image attached below.
Thanks!
In calculus, you are familiar with finding the derivative of a function with respect to a variable. When dealing with vectors, the principles are similar, but you must consider how each element in the vector affects the function.
1. Gradient with Respect to w:
The gradient with respect to the weight vector w is computed as:
\frac{\partial J(w, b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \left( w \cdot x^{(i)} + b - y^{(i)} \right) \cdot x^{(i)}
Here’s how it works:
- Step 1: The cost function J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( w \cdot x^{(i)} + b - y^{(i)} \right)^2
is first expanded as a sum of squared differences.
- Step 2: Apply the chain rule. When differentiating \left( w \cdot x^{(i)} + b - y^{(i)} \right)^2 with respect to w, the derivative of the squared term is 2 \times \left( w \cdot x^{(i)} + b - y^{(i)} \right).
- Step 3: The derivative of the linear term w \cdot x^{(i)} with respect to w gives x^{(i)}, since x^{(i)} is treated as a constant vector.
This gives the cost function gradient with respect to each element of the weight vector w.
2. Gradient with Respect to b:
Similarly, the gradient with respect to the bias term b is computed as:
\frac{\partial J(w, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( w \cdot x^{(i)} + b - y^{(i)} \right)
The key difference here is that differentiating the linear term b with respect to b yields 1, rather than a vector.
- Vector Calculus Application: When you differentiate a scalar function with respect to a vector (like w), you obtain a vector where each component is the partial derivative of the function with respect to one of the components of w.
- Chain Rule in Vector Form: Like in scalar calculus, the chain rule is applied but operates over the vector elements. This is why you see the x^{(i)} term multiplying the error (w \cdot x^{(i)} + b - y^{(i)}).
To deepen your understanding, check out the paper “Matrix Calculus You Need For Deep Learning”, which is an excellent resource for understanding how derivatives work in the context of matrices and vectors, especially in machine learning.
3 Likes
Thank you so much, this was very helpful.
1 Like