Week 1 -Possible Mistake on Lecture Video?

I just watched the lecture video " Understanding Dropout" and I am confused about something. If W_[2] is large and we want to regularize it, shouldn’t we implement keep_prob to a_[1] instead of a_[2] (what is shown on the diagram)? Professor Ng sets the keep_prob for a_[2) to 0.5 in order to regularize W_[2], but based on the following equation and my understanding it should have been the activation on the previous layer (layer 1).

Z2 = np.dot(W2, (A1 * D1) / keep_prob) + b2
1 Like

Hi Dorsa,

I think there’s a small confusion here. When we say we are dropping out some nodes from layer [L], we’ll use dropout in the same layer [L] with a Boolean vector d[L] of same shape as activation a[L], this ensures that some nodes activation becomes 0 for layer[L]. And this will in turn impact, how the Z[L+1} values are computed as we have many zeros in the A[L] layer due to dropout.

Hope it helps.
-Hari

1 Like

Exactly, for the nodes selected for being “dropped out”, we just pretend that their output A[L] is 0.

P.S.

The dropout code can be simplified if so desired to

D = (D > keep_prob)  # a matrix which as "True" on the place to zap
A1[D] = 0.0  # zap the places selected
1 Like

Sorry but I still don’t understand. If our concern and goal is to make changes to weight parameter W[l], which activation (form which layer) will we be downsizing? And maybe if you can explain mathematically how that W[l] will be impacted it will help me conceptualize :folded_hands:t3:

Maybe this is just a language issue, but I would not describe the way dropout works as “downsizing” anything. What it does it randomly disable the output of some neurons by setting their output values to zero in the layers to which you apply dropout. It is applied at the output of the layer, by multiplying the activation output of the layer, A^{[l]} by a “mask” matrix in which the elements are 1 or 0. The “keep probability” determines the probability that a given element of the mask matrix is 1.

You can apply dropout at multiple layers in the way that I just described.

The exact elements that are zeroed out by the mask in a given layer changes randomly with each training iteration. Note that another subtlety is that the matrix A^{[l]} contains multiple columns, each of which is the output for one “sample”. The elements of each column that get “zapped” are also different for each sample in a given iteration.

The effect that dropout has is to weaken the dependence of the later layers of the network on specific neurons in the given layer where dropout is applied. Note that it works both during forward and backward propagation.

This was all described in detail in the lectures. If what I said above does not make sense, it might be a good idea to just go back and listen the dropout lectures again.

There are also a number of existing threads that discuss all this in more detail. Here’s one that delves into the point I mentioned above about the “zapping” being different per sample.

Here’s a thread with a number of posts that discusses the rationale for the factor of 1/keep_prob that you see in the dropout computation.

3 Likes