Curious how one implements backpropagation when using dropout and regularization. Do you just hold static the “W parameter” updates for nodes that were dropped? Or am I thinking about this incorrectly? Welcome any links I may have missed when googling this
just started going through the regularization notebook and see it’s mentioned there… never mind!
Hey @Rob_Chavez,
Glad you found it in the notebook. I just wanted to point out the Dropout paper http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf which has a specific section on the effects of backprop when using dropout.
Cheers and happy learning!
1 Like
It sounds like you’ve already found the answer, but maybe the one other thing worth saying is to “generalize” the point to all forms of regularization: how regularization affects back prop in general is determined by how the regularization affects forward propagation. Forward prop is a composition of functions and you then use the Chain Rule to take the derivatives of all those layers of functions in order to do back prop, right? So what the functions are determines their derivatives, which determines what effect they have during back prop. In the case of dropout, you’re literally zeroing some of the neurons on a per sample basis, so the derivative is also zero for those particular elements and it is also affected by the “reverse scaling” by \frac {1}{keepProb}. In the case of L2 regularization, the mechanism is completely different: you just get a number of new terms in the summation of the cost function and the derivatives of those terms are included in the gradients as well at back prop time.
1 Like
Thank you for the reply and the paper…and with Hinton as an author; sweet
Thank you, @paulinpaloalto, that was a great explanation. Was able to calculate the derivates for backprop for the first course and would like to give it a go again for this one