Week 1 -Possible Mistake on Lecture Video?

dorsa1376m · March 3, 2025, 3:29am

I just watched the lecture video " Understanding Dropout" and I am confused about something. If W_[2] is large and we want to regularize it, shouldn’t we implement keep_prob to a_[1] instead of a_[2] (what is shown on the diagram)? Professor Ng sets the keep_prob for a_[2) to 0.5 in order to regularize W_[2], but based on the following equation and my understanding it should have been the activation on the previous layer (layer 1).

Z2 = np.dot(W2, (A1 * D1) / keep_prob) + b2

Hari_Krishnan_94 · March 3, 2025, 5:32am

Hi Dorsa,

I think there’s a small confusion here. When we say we are dropping out some nodes from layer [L], we’ll use dropout in the same layer [L] with a Boolean vector d[L] of same shape as activation a[L], this ensures that some nodes activation becomes 0 for layer[L]. And this will in turn impact, how the Z[L+1} values are computed as we have many zeros in the A[L] layer due to dropout.

Hope it helps.
-Hari

dtonhofer · March 3, 2025, 7:05am

Exactly, for the nodes selected for being “dropped out”, we just pretend that their output A[L] is 0.

P.S.

The dropout code can be simplified if so desired to

D = (D > keep_prob)  # a matrix which as "True" on the place to zap
A1[D] = 0.0  # zap the places selected

dorsa1376m · March 3, 2025, 10:29pm

Sorry but I still don’t understand. If our concern and goal is to make changes to weight parameter W[l], which activation (form which layer) will we be downsizing? And maybe if you can explain mathematically how that W[l] will be impacted it will help me conceptualize

paulinpaloalto · March 4, 2025, 3:42am

Maybe this is just a language issue, but I would not describe the way dropout works as “downsizing” anything. What it does it randomly disable the output of some neurons by setting their output values to zero in the layers to which you apply dropout. It is applied at the output of the layer, by multiplying the activation output of the layer, A^{[l]} by a “mask” matrix in which the elements are 1 or 0. The “keep probability” determines the probability that a given element of the mask matrix is 1.

You can apply dropout at multiple layers in the way that I just described.

The exact elements that are zeroed out by the mask in a given layer changes randomly with each training iteration. Note that another subtlety is that the matrix A^{[l]} contains multiple columns, each of which is the output for one “sample”. The elements of each column that get “zapped” are also different for each sample in a given iteration.

The effect that dropout has is to weaken the dependence of the later layers of the network on specific neurons in the given layer where dropout is applied. Note that it works both during forward and backward propagation.

This was all described in detail in the lectures. If what I said above does not make sense, it might be a good idea to just go back and listen the dropout lectures again.

There are also a number of existing threads that discuss all this in more detail. Here’s one that delves into the point I mentioned above about the “zapping” being different per sample.

Here’s a thread with a number of posts that discusses the rationale for the factor of 1/keep_prob that you see in the dropout computation.

Topic		Replies	Views
A lecture issue in dropout regularization implementation in week 1 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	713	December 9, 2022
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	830	November 19, 2024
Inverted Dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	22	1796	July 27, 2023
Course 2 -- Week 1 -- Dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	738	June 28, 2021
Dropout scaling fix (division by keep_prob) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	680	September 28, 2022

Week 1 -Possible Mistake on Lecture Video?

Related topics