I’d appreciate if someone can help understand why the implementation of exponentially weighted averages uses VdW = BVdW + (1 -B) * dW (as in the right side of this screenshot) and not VdW = BVd(W-1) + (1 -B) * dW (as in the left side of this screenshot.
There was a similar question from @hazingo a few months ago (see Momentum Gradient Descent question) but I was not convinced by (or did not understand) the answer he got.
Thanks,
v_t = \beta v_{t - 1} + (1 - \beta) \theta_t when t > 0, and 0 otherwise.
Here, \theta_t is a quantity computed in the current timestep.
Similarly,
- dW is computed based on the current mini-batch and
-
v_{dW} is the value so far i.e. excluding the current mini-batch.
The update equation for v_{dW} after processing the current mini-batch becomes:
v_{dW} = \beta v_{dW} + (1 - \beta) dW
Hi @balaji.ambresh, thanks for your reply. What you explain here is exactly what I thought initially but then in the ‘Optimization Methods’ assignment, ‘update_parameters_with_momentum’ function, I’ve got all tests passed using:
for l in range(1, L + 1):
v[“dW” + str(l)] = … * v[“dW” + str(l)] + … (including the minimum possible to adhere to the rules while being able to explain myself)
In this case I am using vdW1 = … * vdW1 + …
I believe that here the existing vdW1 used as input includes the current mini-batch, isn’t this the case?
Your understanding of exponential smoothing is correct.
But. there’s a difference between timestep and layer number. I recommend going moving forward with rest of the exercises, all the way upto and including def model
. It’ll help reinforce the following facts:
- Gradients are computed based on the current mini-batch
- How difference in time comes into play for v is on the right side (past) vs left side (current) of the equations.