I don’t understand the use for a bias in the final assignment.

We are training a single weight, theta, to apply to all x features. Why do we need a value for bias when we are essentially using a single node, and the bias value for every x is identical (1)?

The bias is part of the maths behind the linear regression and then activation. It is part of the Deep Learning algorithms. It may start at 1 but will be changing as the model trains on.

What Logistic Regression does is find a linear decision boundary between the “yes” and “no” answers. If you eliminate the bias, then you can literally only get lines or planes through the origin, so you have severely limited the possible solutions. Not all data happens to match lines through the origin.

The weight for bias stored in theta remains constant, but the features of the tweets are always [1, pos, neg].
What do we gain from the bias, 1, when it’s constant across the entire training set?

That doesn’t make sense to me. Our logistic regression is calculated with a sigmoid function, which will always have the same shape. We assign a positive or negative label based on the output value of the sigmoid - whether it’s higher than 0.5 or lower, since the sigmoid is bounded to output values between 0 and 1.

What I’m confused about is the need for an extra feature, [1], that is identical for every element of the training and testing set. It adds no useful information whatsoever for gradient descent to learn from. The theta we arrive at from the test set has a weight for bias of 0.002. The algorithm is practically ignoring it as an input.

If we discarded the bias and only used two features - positive and negative, even without retraining, we would arrive at almost the same result.

There is a coefficient corresponding to bias, right? So you are learning a value for that as well as for the other “weights”.

To understand the point about the decision boundary think about what it means that we interpret the output of sigmoid > 0.5 to be “yes” and <= 0.5 to be “no”. So that means that the decision boundary is expressed by the following equation:

If you have taken Linear Algebra, you will recognize that as the equation for a hyperplane in the input space \mathbb{R}^n, where the normal vector to the plane is given by the vector \theta[1,n] and the bias value \theta_0 determines the orthogonal distance from the origin to the plane. So that is the “decision boundary” and what we are doing by applying Gradient Descent with the log loss function is learning the \theta values that define the decision boundary that give the lowest overall cost given the training dataset we have. Of course the cost may well not be zero, because there is no guarantee that the data is “linearly separable”, meaning that the actual decision boundary required to get correct answers on all points may be more complex than a plane. Oh, well. Maybe in that case we need a more sophisticated type of algorithm than Logistic Regression that can define a non-linear decision boundary. Stay tuned for that later in NLP.

But with the above in mind what I said earlier should make sense now: if you set the bias term to zero, then what you are doing is saying you will only accept decision boundaries that intersect the origin. That is a very significant constraint on possible solutions and will very likely give you bad results in a lot of cases.

I appreciate the time you’ve taken to put together this answer. Thank you.

I will need to catch up on hyperplanes and normal vectors, as I skipped Linear Algebra and went straight to Calculus in university.

If I’m understanding you correctly though, it seems like our training data happened to be divided in a way that was very nicely split along a line through the origin. That meant that the weight for bias needed to be quite low - to maintain the intercept of that line. If the data was translated higher or lower, though, we’d need a different bias value to still be able to accurately predict.

I’m glad your answer touched on what input value returns a sigmoid of 0.5, because originally I was under the impression that if our data was skewed in one direction or another, we would adjust what our threshold sigmoid value was. For example, setting the boundary at 0.7. Instead it should be the weight for bias that adjusts.

Regarding your comment, I’d like to say that the 0.5 threshold, or whatever threshold, is defined by you, the architect of the NN. It doesn’t have to be 0.5. It can be anything you define, between 0 and 1. It will depend on your specific use case. For instance, in a Medical application, like diagnostic of a disease based on X-Rays, the architect of the NN may decide to set the threshold at 0.7, to have positive results with much higher probability. There may be other use cases in which you may want a threshold set at, say, 0.3.

When training a model, is it possible to make the threshold another variable for the model to learn? Having it train in order to optimize either accuracy or precision, for example, by setting a threshold value that maximizes that output?

Or is it correct to simply view the sigmoid output similarly to a normal curve plotting probability, where one would decide they only want values that have a minimum 80% probability of belonging to a category? The only issue I see with that is that the domain for the sigmoid function continues to both infinities, so you would need to bound the x values if you wanted to calculate an area under the curve.

Are either of those solutions regularly performed?

The threshold will be a parameter that you define and pass to the function.

I would not set the threshold with the objective of maximizing the output. I would set it as a means to reach my model’s objective.

If the resulting number is greater than your threshold, then you can interpret the result as TRUE, which would mean that the sample belongs to the class (cat, no cat).

If the resulting number is less than or equal to your threshold, then you can interpret the result as FALSE, which means that the sample doesn’t belong to the class (cat, no cat).

The sigmoid is a function that is considered to go from 0.0 to 1.0, and not from -inf to +inf.

The threshold I set should reflect reality though. If a threshold of 0.5 results in an accuracy of 50%, whereas a threshold of 0.8 results in an accuracy of 95%, there is an objectively better choice, isn’t there?

My question is whether or not the threshold can be treated as a tunable hyperparameter - like how you can perform a grid search for optimal hyperparameters for other machine learning algorithms.

Maybe my area under the curve question was confusing. The output of the sigmoid function is bounded from (0, 1), but its input values are from x = -inf to x = inf. So there isn’t a defined are under the curve when considering all x values.

The threshold is a hyperparameter, yes. You can set it to whatever value you need. In keras, for example, you can explicitly set the threshold to whatever number you need.

I would like to insist that the threshold should be set to reflect the reality of your purpose, and not be set just to accommodate the distribution of your data.

The threshold should be set with the end goal in mind. Coming back to the example of medical diagnostics: lets say you have 10,000 X-Rays of which 100 show a disease and 9900 show a healthy pair of lungs. Your goal is to identify when the X-ray represents a disease. would you set the threshold at 1% or 99% to represent the distribution? I would not. I would probably set it at 0.7 (70%) to represent the positive cases with a 70% of certainty.