C2W1: Programming Assignment on Regularization

weishun_soong · October 1, 2023, 11:52pm

Hi everyone, I understand the idea behind dropouts of forward prop where we have to mask A1 with D1 and scaling it by dividing A1 by keep_prob.
A1 = A1*D1
A1 = A1/keep_prob
However, since we have already applied the dropouts during the forward prop, the cost function for the first iteration would have taken that into account. Then wouldn’t dA1 already be based on the scaled and masked A1? Why do we still need to multiply dA1 by D1 and then dividing it by keep_prob?

Am I missing something here?

paulinpaloalto · October 2, 2023, 4:28am

The gradients are just the derivatives of the forward propagation steps. So if the forward function has a factor of 1/keep_prob, then the derivative will also, right?

Also remember that dA1 is one output of the back prop calculation at layer 2, so it does not automatically have any entries zeroed by the mask. That happens as we do the back prop calculation for layer 1. dA1 is just an intermediate value that we need to calculate the gradients that we actually apply, which of course are dW1 and db1 for layer 1.

weishun_soong · October 2, 2023, 5:39pm

Thank you for your prompt reply! Perhaps I am still missing something here.
For example, assuming a column vector with each value calculated based on the equation, y = x^2 + 2x.

Let’s say it has 5 rows, with x= 1, 2, 3, 4, 5, v=[[3, 12, 15, 24, 35]]. And after multiplying by a D1 layer and dividing by keep_prob(0.8), v(shut)= [[3.75, 0, 18.75, 0, 43.75]].

Thereafter, when we use the values of v(shut) to calculate the derivative values, aren’t they already masked and scaled accordingly?

paulinpaloalto · October 2, 2023, 9:50pm

The point is that the derivative is the derivative of a function, not a substitution of a particular set of values. The “values” are being back propagated from the “downstream” layers. Do you know what the Chain Rule is and how it works? Note that Prof Ng does not really cover the derivation of back prop because these courses are designed not to require a knowledge of calculus. Here’s a thread that covers the basic derivation of back prop for a feed forward net without dropout. Maybe based on that, it will make more sense if you contemplate how to incorporate dropout into that picture.

Topic		Replies	Views
Gradients with dropout Improving Deep Neural Networks: Hyperparameter tun	5	719	July 28, 2023
Backpropagation with Dropout Improving Deep Neural Networks: Hyperparameter tun	5	660	April 19, 2023
What is "dZ1 = np.multiply(dA1, np.int64(A1 > 0))" Improving Deep Neural Networks: Hyperparameter tun	4	574	August 29, 2022
Week 1 Regularization - Forward Propagation With Dropout Improving Deep Neural Networks: Hyperparameter tun	4	512	August 25, 2023
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun	5	828	November 19, 2024

C2W1: Programming Assignment on Regularization

Related topics