Week 1, Dropout Regularization

In the lesson, after reducing the activation matrix a3 by 20%, it is divided by the keep_probs value to scale it back up. I don’t understand what’s happening here.

How does a3 /= keep_probs work?

Dropout accepts two inputs. One is of course input tensor, and the other is “training” to show whether the network should work in a training mode or in inference (prediction) mode. The reason why there are two modes is, Dropout works during a training mode, but it does not during a prediction mode. At a prediction time, to get a similar output as a training time, it is better to set the amount of network flow equal to the training time where dropout was working. If keep_probs = 0.8, then, the amount of network flow during a training time is 0.8 times of that in a prediction time.
So, there are two ways. Reduce the amount of network flow in a prediction time by multiplying 0.8 or increase the amount of network flow in a training time by dividing 0.8. Latter is “Inverted dropout” that Andrew introduced, and what you mentioned.

1 Like

Thanks! I will look up graph theory now

Here’s a thread from a while back that discusses this in more detail and shows examples of the effect of the inverted dropout on the L2 Norm of the activation outputs.

1 Like

Thank you! That’s helpful