When looking for bias or variance, my understanding is that increasing the degree of polynomial corrects for bias while increasing lambda corrects for variance.
WIth that said, my understanding is that Lambda should start small and work your way up. Do you do the opposite with degree of polynomial? Meaning, start with a high degree and work you’re way down?
How do you decide on starting values for lambda and degree of polynomial?
Or am I overthinking this and it’s all done within Tensorflow?
Thanks
1 Like
Hello Amit,
this is a very good question, you are not at all overthinking, you are being curious about how models work and that is a good learner’s skill
Bias and variance would be based on the performance comparison between training error and cross-validation error as you must have seen in the video.
So in the below image it shows how lambda varies based on underfoot and overfit models, so as you are asking me at what value lambda should start, based on this always start with a intermediate value i.e. if your training dataset is larger dataset you could start with a smaller lambda value only.
But then if your training model has high bias or high variance, you need to adjust lambda based on the below image.
- If you add additional features, it will help you fix high bias.
- if you add polynomial features, it will help you fix high bias as it will do better on the training set.
- Using a smaller or decreasing lambda is done when the model is using lower value of regularisation parameter, so the model pays less attention to the learning rate and more attention to the training set thereby fixing the high bias problem.
- Using a larger or increasing lambda is done when the training model is given too much importance to fit the training dataset and ignoring the new examples causing high variance. So increasing lambda will reinforce the algorithm to smoothly and don’t wiggle too much thereby fixing high variance.
But the basic understanding in training model if algorithm has high variance, get more training data or simply your model, by simplyfing means using a smaller set of features or increasing the lambda, so that algorithm has less flexibility to fit complexity of your model.
Same way if algorithm has high bias that means the algorithm is doing well on the training set, that time you need to make your model more flexible by adding features, polynomial features and decreasing the regularisation parameter lambda.
This explanation based on one of the video from the same week by Prof Ng.
Regards
DP
Thanks Deepti for that explanation.
So practically when running a NN, is it standard practice to always add the "kernel_regularizer=L2(0.01) parameter and does that only refer to layer 2?
The regularisation parameter always starts with the smaller value assuming the algorithm for training dataset is going get the accuracy one is looking for based on the desired features added and having the right amount of dataset as well as splitting of data.
Remember if your training dataset is good, you won’t require higher value of regularisation parameter lambda to be higher as your algorithm is training well.
But in case it doesn’t, you just need to decide based on your error value that your algorithm has high bias or high variance.
L1 regularization adds the absolute value of the weights, while L2 regularization adds the square of the weights. So L2 here is not layer 2.
Below I have shared some of the advantages and disadvantages between L1 and L2 regularisation parameters as well as difference between the both
Regards
DP