In Gradient Checking Implementation Notes Video, Andrew NG Sir mentioned that “gradient checking doesn’t work with dropout regularization”. Let’s say in an iteration, we knocked out some units of our neural network during forward propagation and calculated the cost “J_dropout”. Now we knock out the same units during backprop to get the gradients of “J_dropout” wrt parameters. So how would this be different from the derivative of J_dropout which is calculated using differentiation? (as we are using the same reduced NN)

If it is different, are we not violating the fact that backprop gives gradients?

Hi @sgurajap ,

You can in fact do gradient checking for one example and a specific set of disabled neurons with Dropout, but what you will be validating are the gradients for this specific dropout set.

In order to validate the gradients considering all possible dropout sets, the cost function J will be extremely complicated to calculate as it will need to include all those possible sets.

In the Gradient Checking Implementation Notes video, at time 2:32, Andrew describes the problems to try to generalize the cost function J for all the dropout sets.

And at time 3:30, he also mentions that gradient checking can be done for a specific dropout set, but in this case, you are not checking the complete NN.

in any case, when debugging a NN, it is really easy to disable Dropout (set keep_prob to 1.0), evaluate the gradients, and re-enable Dropout for the training.