I am reviewing the DLS2 these days and try to fill the voids in my previous analysis. I am wondering what is the difference between cost function between L2 and Dropout methods. In summary, we added a 2nd term to the cost function in L2 case both in forward and backforward path. This clearly results in reducing the weights and prevent the model from overfitting. To have a monotonical observation in cost function vs the number of iterations, we must add that 2nd term.
However, the story is a bit different and it is a bit vague to me. We do dropout to learn the model during training to not depend on a specific features, having a robust model. To plot the cost function with regard to the number of iterations and observing the decrease over iteration, we set the dropout to 1 (no regularization) . I would be thankful if this is explained how is model able to show the decrease without regularization effect at this time? we practically removed the dropout effect and it will be then turned on as it was said in the lecture.
Thanks in advance.
In the case of any form of regularization, including both L2 and dropout, you only include the regularization during training. You do not include it during inference mode (just making predictions with the model). In the case that you are using the cost to assess whether the convergence is working, it’s a bit more complicated. In the case of L2, since the L2 term is simply an addition to the real cost, the changes in value of the regularized cost are a bit clearer. In the case of dropout, Prof Ng’s point in the lectures is that with dropout enabled, it’s literally a different network on each training iteration, so comparing the output of the cost in that case is not “apples to apples”. That is why you need to use the cost in “inference mode” with the dropout disabled when you are using the cost to assess whether convergence is working or not. Of course you could view this as a purely mathematical point and the question is whether it really matters that much from a practical standpoint. I have not tried any experiments with dropout to see how much of an effect there actually is on the behavior of the cost with dropout. Of course the “keep probability” has to be a factor in that: if you’re using a relatively milder value (e.g. 0.9), it will matter less. If you ever run into this in “real life”, you can try some experiments and let us know what results you see comparing the two methods of judging convergence.
Maybe the simpler way to state the answer is that the actual model that is the result of training at any point in the training process (any number of iterations) is the model with dropout disabled, right? That is the actual model in question. In other words the model that we will actually use for making predictions.
One other higher level point to make is that looking at the cost is actually a fairly crude metric for the behavior of convergence. Prediction accuracy is actually the gold standard. Note that because accuracy is quantized, a lower cost does not necessarily mean higher accuracy.
Ok, then. perfect. Regarding your 1st feedback, I agree with you. The L2 case is more explainable and obvious than dropout case when speaking of cost concerning the iteration axis, at least. Anyway, it is confirmed that the effect of the robustness channel remains after iterations even when dropout is disabled!