The need for a bias term

Hello everyone,

Again with a probably very basic question :slight_smile:

Each ‘neuron’ is divided in two parts: a 1st part with applies Z = W.X + b (the linear function) and a 2nd part, which applies Sigmoid of Z (the activation function) to deliver a 0 or 1 output. Trying to understand the importance of each part (and please correct me if I’m mistaken), I come to the conclusion that if we did not have the linear function, we had not way of improving our algorithm because we had no parameters to be updated (there was no need for the cost function in this case); and if we had no activation function, we had no way of passing the output of Layer 1 to be an input of Layer 2, so we would be cutting the communication between Layers, effectively killing the dynamic of Forward and Back Propagation.

Regarding the linear function, what is specifically the purpose of the bias term, besides being a parameter that can also be updated? Searching a little bit about it, I see that it’s purpose is, among others, to generate an output of the linear function even when the input (X) is zero. But if the input X is zero, why would I want this Z (which would be a constant, since W.X would be = 0) to be passed on to the activation function and along the other Layers? I also read that it is important to have it as an updated parameter to improve the model’s accuracy; but here I’m tempted to ask why not then add a third and fourth parameters just to have them updated as well (effectively creating a new linear function, Z = W.X + b [- xyz, …] ) and improve the general model accuracy?

The role of W is a little more clear to me, as it may allow us to increase the strength (weight) of certain X features, but still can’t clearly understand the real important role of b (and what would we miss if the linear function would simply be Z = W.X).

1 Like

Hi jpedroanascimento,

regarding the activation function: Its purpose is to transform the linear regression into a form than can be better digested by either a subsequent layer or final output.
You could also drop it - then simply the response term Z would go into the next layer/output.
For instance, when the target label is not a class but a continuous value, then you would not define an activation function for the final output: the output would be just Z.
In case of an internal layer node, the activation ‘ReLU’ is quite effective. The only thing it does is to transform negative values of Z into zeros, hence max(0, Z).
But what is the effect of this change? Whenever Z is negative, the neuron does “not fire” and the weight associated with this Z in the next layer is likely to be zero. Such signals will be ignored and the weight matrix becomes sparse (which has benefits of its own).

Regarding the bias term, it is just an intercept or offset for the linear regression. You can imagine it has a shift of the whole regression formula X.W up or down the Z axis (output).
In case X=0, Z could be for instance Z != 0, thus you would need the bias.
As an example, image a white pixel with value 0 , but you want to predict it as class 1 (Z=1): Z = X.W + b → 1 = 0.W + b → b = 1
A model fit would learn that b must be 1 for the least loss.

1 Like