Plotting Cost Function J with Dropout

In course 2, week 1, Regularizing your Neural Network section, Understanding Dropout video, in the last minute of the video, Doctor Ng. says that plot the cost function J without the dropout (turn off dropout or set keep_prob to 1, run gradient descent and then plot cost function J)
I wondered what is wrong with plotting cost function when applying dropout?
I know dropout removes some nodes randomly but this random node removing just adds noise to the cost function which means we might not see that expected monotonically decreasing plot of cost function J but overall it should keep its decreasing trend with some noise, am I right?
If above statements are true, then we should still be able to double check the gradient descent performance by plotting cost function J with applied dropout.

Prof Ng does give a more detailed explanation in the lectures about the effect of dropout on the cost function. It turns out that technically you have a different cost function on each iteration, because dropout causes the network to be different. Mind you, I have not really run any experiments to see how severe this effect actually is. Of course the keep_prob value will matter in that determination. But the other general point is that when you’re tuning hyperparameters, “orthogonality” is really helpful: if you’re turning multiple dials at the same time (number of iterations, learning rate, keep_prob …), then it becomes a lot more difficult to know which parameter change is having the effect. So you can keep it simpler by first tuning your Gradient Descent without dropout and then apply dropout later to remedy any overfitting. At least that is my interpretation of what Prof Ng is saying here.

1 Like

Yeah this is right and I totally understand that this process of randomly removing nodes will change the value of cost function and definition of cost function won’t be stable anymore but the effect of randomly removing nodes (and scaling up cost function by dividing it by keep_prob) will be some noise (measure of this noise is based on the value of keep_prob), this can make huge discernible errors while looking at a single training example but in deep learning we go through forward and backward propagation many times which means we will have set of data points (values of cost function on each iteration), these data points contain random errors (noise) individually but based on the ‘central limit theorem’ the average of these datapoints will be extremely close to average of datapoints we achieve without dropout so orthogonalization won’t be damaged and we will have good insight if we look at whole set of data points and the trend instead of checking each iteration individually.

It is always the case that we don’t look at the cost on every individual iteration, because we only care about the trend and there is no guarantee of monotonicity or even convergence with any particular fixed learning rate and initial starting point. The cost surfaces of neural networks are incredibly complicated.

The Central Limit Theorem is helpful for the case of sampling the costs for a given distribution, but the distributions are different with and without dropout at least in theory and until proven otherwise, right? The point is that this is a different source of variability than simply sampling larger and larger subsets of a given distribution.