Gradients with dropout

The gradients are just the derivatives of the forward functions, right? And the forward function is scaled up by that same factor. That is the so-called “inverted” dropout. Prof Ng explained why this is done in the lectures: You will eventually be running the network in prediction mode with the trained weights and no dropout (all forms of regularization only happen at training time). So if you don’t compensate for the dropout, the weights of the subsequent layers will be trained on lower activation outputs than they are really getting when applied in “prediction” mode. Here’s a thread which discusses this in more detail. To get the full picture, you’ll need to read all the posts on the rest of that thread from that point forward.