Cost function is not well defined using dropout

In the video “Understanding dropout” prof. Andrew said that the cost function of NN will not be well defined which is quite understandable but after that he said the learning curve (plot of training loss vs Val loss) can not be used. Why?? And after that, he said that he often “turns off dropout or sets keep_prob=1” but when?? training phase or inference phase??

when I use TensorFlow and Keras to do some training I know that the dropout layer will behave differently if we set arg training=False or True but I still don’t get the phrase “turn off dropout or set keep_prob=1”. When to do that and if I have a dropout layer in my NN I can not plot learning curve??

Doesn’t the training loss in the learning curve compute using NN with dropout?

All forms of regularization (L1, L2, dropout …) are only applied during training. So anytime you are in “inference” (prediction) mode, you disable regularization, which includes setting keep_prob = 1 if you are using dropout. I think Prof Ng’s point in the section you are referring to is that when you compute training accuracy or even the cost for the purposes of plotting the learning curve, you treat it as if you are in prediction mode and set keep_prob = 1 to get a well defined cost function.

1 Like

thanks for replying sir. then what about the part the prof. Andrew said learning curve can not be used? I confused that if we use drop out layer can we still investigate training process with the learning curve?

Prof Ng is recommending against doing that. But you could do it if you used the technique I alluded to above: every 100th iteration of training, you could rerun the forward propagation with keep_prob = 1 and then use those results to compute a consistently defined cost and training and validation accuracy values. In other words, you would be using the full trained model (not the randomly subsetted one) there every time, so the results will be meaningful. It would be a little more code, but would allow you to track the convergence in a mathematically correct way. But it’s an interesting question whether the mathematical point that Prof Ng is making here would really have that much effect in a “real world” sense. All this behavior is statistical anyway, so the fact that dropout is perturbing the cost function could be considered as just adding some more statistical noise to the learning curve data. Of course how much noise will depend on your keep_prob value. You could try some experiments comparing the “pure” method I suggested above with just taking the “incorrect” perturbed cost every 100 iterations and see if the two methods actually give you results that are that much different. But I’ve never tried that comparison and Prof Ng is the expert here: he probably wouldn’t have brought this up and spent the time on it if it didn’t really matter.