Implementing exponentially weighted averages

I’d appreciate if someone can help understand why the implementation of exponentially weighted averages uses VdW = BVdW + (1 -B) * dW (as in the right side of this screenshot) and not VdW = BVd(W-1) + (1 -B) * dW (as in the left side of this screenshot.

There was a similar question from @hazingo a few months ago (see Momentum Gradient Descent question) but I was not convinced by (or did not understand) the answer he got.

Thanks,

v_t = \beta v_{t - 1} + (1 - \beta) \theta_t when t > 0, and 0 otherwise.
Here, \theta_t is a quantity computed in the current timestep.

Similarly,

  1. dW is computed based on the current mini-batch and
  2. v_{dW} is the value so far i.e. excluding the current mini-batch.

The update equation for v_{dW} after processing the current mini-batch becomes:
v_{dW} = \beta v_{dW} + (1 - \beta) dW

Hi @balaji.ambresh, thanks for your reply. What you explain here is exactly what I thought initially but then in the ‘Optimization Methods’ assignment, ‘update_parameters_with_momentum’ function, I’ve got all tests passed using:

for l in range(1, L + 1):
v[“dW” + str(l)] = … * v[“dW” + str(l)] + … (including the minimum possible to adhere to the rules while being able to explain myself)

In this case I am using vdW1 = … * vdW1 + …

I believe that here the existing vdW1 used as input includes the current mini-batch, isn’t this the case?

Your understanding of exponential smoothing is correct.

But. there’s a difference between timestep and layer number. I recommend going moving forward with rest of the exercises, all the way upto and including def model. It’ll help reinforce the following facts:

  1. Gradients are computed based on the current mini-batch
  2. How difference in time comes into play for v is on the right side (past) vs left side (current) of the equations.