So I know that bias is needed to shift the line of best fit along axis (in some case its x and other cases y axis), but why? Why can’t we just shift the origin such that the line always passes from the origin?

What does it mean to “shift the origin” from a mathematical p.o.v.? I have not taken this course, so I’m don’t specifically know what the problem is that is being solved here, but if it is using Machine Learning, then bear in mind that you don’t know what the solution is *a priori*, right? So how do you figure out what the right amount is by which to “shift the origin” to achieve the elimination of the bias? How does that simplify things relative to just including the bias term and letting the algorithm learn the value along with all the other weights (parameters) that are being learned?

Mathematically, that’s the same thing as adding a bias.

Yes I have read this on a blog, but couldn’t comprehend. Would like to help me?

I know why they are variable, just a means to change it based on learning.

Sure, but I think Tom and I answered the question, didn’t we? If you don’t include bias, then you are putting a very severe artificial restriction on the solutions that you can find. It’s exactly analogous to saying you will only accept lines through the origin. Yes, you’re right that this is logically equivalent to applying two transformations in order: shift the origin by the bias amount and then do the “rotate and scale” linear transformation. But the problem is that the bias is learned, right? You don’t know what it is until after you run the training. So Prof Ng has given you a way to learn both transformations in one step. You want to make it two steps, so it’s your job to show us how you could do that and why that is somehow better than Prof Ng’s method.

Not sure if this will help.

In an earlier version of ML courses, Andrew taught that the ‘bias’ value was simply considered to be another weight.

So a straight line would have two weights - one for the slope and one for the y-offset. This vector of weights was called ‘theta’.

Mathematically this is implemented using feature values of [1, x] for every example. So the predictions in that form are h = X * theta.

This lets you use a single dot product to compute the entire model, instead of what these DLAI courses do; (f_wb = w*x + b), which is a dot product for the weights and then adding the bias separately.

Mathematically the two methods are identical.

Transformations! Yes, it is the operation that changes one basis to another.

Matmul is nothing but a transformation. Finally I got the answer. And @TMosh, this is why we use bias a weight for constant input (1). Otherwise it would look like we are changing the origin, and that would violate the rule for what we call **Linear Transformation**.

Somehow it all makes sense now.

The mathematical term for what we are doing here is an Affine Transformation, which is a linear transformation plus the bias.

Ah, this term again came.

First I heard this while learning difference between vector space and affine space. What I know is affine is related to *some origin stuff*. So now you mentioning it makes more sense to me.

Transformation, because it is changing something, and linear because max power of is 1.

Found something interesting and worth sharing

https://personal.math.ubc.ca/~cass/courses/m309-03a/a1/olafson/affine_fuctions.htm