Why aren't we using difference of y^ and y to check the effect of w and b

The point about why cross entropy loss is appropriate as the loss function for classifications does not really have to do with the fact that sigmoid output values can be very small or very close to 1 for that matter. Please read the other thread that I linked above.

We do need derivatives of the loss function in order to do back propagation and optimize the solutions. There are ways to deal with linear differences as the loss function, but in regression problems it is more common to use the square of the difference as the loss metric. The mathematical behavior of the derivatives of the squared distance is more useful. But there are cases in which the so-called L1 or “Lasso” loss is used. But none of this is really covered or that relevant in DLS Course 1.

1 Like