Hi, what I mean is, that the regularised decision boundary may “ignore” (as in not contain) some edge-case data points.
However, what of situations where the edge case actually would always be off in that plot area (so for this particular set of features’ values, the result would always be off the generalised boundary)?
Or do I perhaps misunderstand it and in a categorisation task, the data outside the regularised boundary that is of the same value (true/false) as the data inside, must always be a false-positive (so the interpretation of such point would be, that even though it is “true” (+), the algorithm predicts it is actually not?)
Then what if the point is in real life true and it was measured correctly - then the algorithm is incorrect to predict that it is not (because it is outside the decision boundary, in such example)?
Ok, too many features may cause overfitting and more data may fix it - however, what I am asking here is, what if it is a fake-overfitting, as in it looks like overfitting that needs regularisation, but actually is not overfitted and the strange irregular boundary was correct? - then applying regularisation would worsen the prediction
An example course plot as an illustration. Purple is the boundary I’d want.
Actually when you doing any model you do regularization with lambda = 0 …so when we do regularization it let you to control both overfit and underfit …so if you don’t want to make regularization set lambda=0 if not you can move lambda until the DV set or test set error is a close to training set and you reach to a best minimum percentage for both overfit and underfit in this case and beside that you can doing for example polynomial in you feature if you model has underfit … You say that may be it can fake overfitting you can discover it by split data to train , test and , dev and measure the error in all if they are close in this case the model has fake overffiting otherwise it isn’t fake overffiting
I hope it answer you questions,
Please feel free to ask any questions,
I’d like to add that it is difficult, if not impossible, to get a perfect model where all predictions are True Positives or True Negatives.
Regularization is ‘one more’ tool in our toolkit to improve the quality of a model. And, as you mentioned, this is a tool that will help us mitigate overfitting.
As the model creator, you will have to define what is it that you want/need. If having a model that fits most or all corner cases is what you need, go for it. May be your data distribution is such that this is what you need.
You show a purple boundary that is more ‘sophisticated’ than the green boundary. It goes to more ‘corners’ than the other boundary. Is it overfitting? It really depends on your data distribution in the training-validation-test sets. Lets say your model arrives to a boundary like the one you want. If all 3 sets produce a similar accuracy under this model, then you don’t have an overfitting model, even though the boundary is that irregular. But if you get this irregular boundary with your training set, and then you process your test set and there is a drop in accuracy, then you will have an overfitting model that needs some attention.
In short, overfitting is not determined by how irregular the shape of your boundary is, but by the fact that your model may be very accurate on the training set (for example 90%), but very poor with new data (lets say for instance 60%).
Are the data points in your plot training data or test data?
An important yardstick used for measuring overfitting is the performance of the model on the test set. So, if we have a situation where the model has high accuracy on the training set but a noticeable drop in the accuracy on the test set, then we consider this as a classic case of overfitting - i.e. model is not able to perform as well on unseen/new data.
We understand that this situation of overfitting generally would not be caused by the data lying within the inner regions of the decison boundary, but by the data lying more towards the outward periphery…nearer the meandering twisting boundary.
Now, if this meandering boundary is not an aberration, rather it is the actual boundary we are after - In that case, the test data should also have data lying near the periphery, which can be correctly classified by having the twisting decision boundary. And if such data points do exist in the test data, then we should not be seeing a noticeable dip in model accuracy on the test data.
If this noticeable dip does not happen, then this meandering twisting data is the actual boundary we are after, and hence we should not be seeing overfitting happening. Consequently, we would not have to resort to regularization.
Point to Ponder: We do not want our model to be influenced by noise. But if the noise shows up on both the training and test set, is it still Noise or is it the actual signal?