Week 1 bias in feature vector

Greetings,

I got stumped when I initially saw that the feature vector for a tweet contained this bias value of 1 (x_i = [1, sum_pos, sum_neg]). Not only was bias not explained, but it was counterintuitive to introduce an extra dimension, right after we went through the process of reducing dimensionality of a sparse vector.

So I tried to recreate it in my head why the value “1” is needed there. Below is my very loose attempt at explaining it. Hopefully instructors can come in and fill it in in the course content - it really is a problem when things like that show up with no explanation.

Logistic regression is one of the simplest classifiers you can think of. If you plot points on a plane, and draw a line, some points will be above, some below. Logistic regression tries to find a line that results in the smallest number of points being misclassified. If you remember from early algebra days, a line has a slope and a constant term you add to “raise” or “lower” it. A logistic regression model contains “slopes”, which are called weights, and the constant term that gets added to account for “raising” and “lowering” the classification boundary. The model is then combined with the input (the feature vector for the tweet) - each feature of the input is multiplied by its appropriate model weight and the products are added, after which the constant term is added, which is called bias in logistic regression. This can be cleverly represented as dot product multiplication of input vector and model’s weight matrix, except for the bias term. So you could just continue to remember to add it, but then all the optimizations and theorems we have from the matrix world cannot be used. So to use dot product to represent the formula for logistic regression, we make the first term of the input vector always equal to “1”, and in the model, we have the bias in the first position.

Hopefully, this helps someone move on to the rest of the week’s material…

3 Likes

Hi pslusarz,

Your point and explanation make sense to me. Maybe this can be picked up in a upcoming review of the course.

Thanks!

I had this same question. I appreciate that you came up with your best guess as to why it’s there, but I have my doubts on that.

Typically in ML we have bias terms when computing weights, but I’ve never seen it added before as a feature, to the inputs. Especially since this term is 1) not learned or modified at all by backprop and 2) the same for every tweet, how can it add any value then?

I am still watching this week’s lectures but I aim to try the programming assignment later with these bias terms and then again without them. I’ll report back if it makes any difference. My theory is it doesn’t hurt, but doesn’t help (a neural network should learn to ignore any input feature that seems to have no correlation with the target values).

1 Like

OK, I think I see what they’re doing. They are basically pushing the bias terms just into the weight matrix. When you use a framework like Keras, it automatically stores those bias terms for you somewhere. But when doing this manually, you need something to be multiplied against the bias weight in your weights matrix (theta0), and so that are these 1s you add to the X features.

It is correct that those 1s do not get modified by gradient descent, just theta0 does.