Hello @jiikoo,
If I were you, I would first position myself to ask a question: “What if I regularize the output layer?”.
I know, I know, that the assignment wouldn’t allow us to do it because doing so would fail a test, but if we forget about the test, we actually can, can’t we? So I have done some very quick experiments (and so can you!) and my results below:
So, by comparing the 1st & the 2nd results, we know that regularizing the output layer didn’t do any good, but if we tune the lambda value (which we always have to), as in the 3rd result, it wasn’t so bad when comparing to the 1st!
Therefore, the messages I would get from these results are that: (1) we can do it, only we need to jump out of the box (assignment)! (2) we always need to tune lambda for the best result.
Futher on (1), if you remember when we first learned about regularization in previous week, we were doing it to output-layer-only logistic regression, right? In other words, we did apply it to the output layer, because logistic regression only has an output layer and that didn’t cause any problem. Right? So, we must tell ourselves that it is not absolutely wrong to regularize an output layer.
However, though my 3rd result wasn’t bad, it wasn’t either better (than the 1st)! Of course, I only spent like 2 minutes on these experiments, and I wouldn’t be surprised if you could find the better lambda value.
Still, would there be any argument for me to skip regularizing the output layer? Yes! There are two reasons:
-
Previous layers are already properly and sufficiently (with the right lambda) regularized to counter the overfitting effect. (Note: in output-layer-only logistic regression, there is no previous layer, so the output layer is our only choice for regularization.)
-
While L2-regularization tends to make weights in the output layer smaller, our friend softmax may try to make some weights larger! You see - L2-regularization always penalizes (large) weight values, but softmax has to make sure the outcome from the output layer to add up to one! In other words, softmax won’t allow regularization to push the weights to too-small or even to just small-enough. In short, softmax can fight L2-regularization!
How can I avoid them fighting each other? I relax it by not using L2-regularization on one of the layers. That does not have to be the output layer, in fact, if I choose only to regularize the first and the output layer, it, again, wasn’t so bad (comparing the 2nd & the 4th results, and comparing the 1st & the 4th):
So, it’s like, I let L2-regularization do its best to two of the layers, while even though softmax acts on all layers, it can act more on the unregularized one to fulfill its ultimate goal of “adding up to 1” and give the other two a chance. Again, I only spent 20 seconds on my above 4th result, so it may still be just suboptimal and it wouldn’t be suprising to find better lambda value for just regularizing Layer 1 and Layer 3.
Having said that relaxing one layer would be my strategy to avoid them fighting each other, I will still choose to relax the output layer rather than the middle layer but my reason would be related to back-propagation which is a critical concept in deep learning but unfortunately not yet fully covered in MLS (but is covered in DLS - the Deep Learning Specialization), so I would choose to skip the reason for this discussion.
So far, I might have gone through a lot for this discussion so it is natural if some readers might want to read it multiple times or even do some experiments like mine to better grasp my ideas, but here are my key take-aways:
- nothing absolutely wrong about regularizing an output layer
- regularizing or not, tuning for the best lambda value is a practice - we don’t say regularizing it is bad, we say regularizing it without being able to find a set of good lambda values is bad.
Cheers!
Raymond