I got my code pass the checker but don’t entire understand the math behind this.
v[“dW” + str(l)] = beta * v[“dW” + str(l)] + (1 - beta) * grads[“dW” + str(l)] (1)
which is supposedly the correct implementation is the same as:
v[“dW” + str(l)] = (1 - beta) * grads[“dW” + str(l)] (2)
since v[“dW” + str(l)] is initialized = 0.
I tried (2) and pass all test.
Should it be v[“dW” + str (l-1) ] for l>1 and just 0 for l=1, as we take ‘beta’ part of the LAST momentum and give it a bit more acceleration?
Am I understanding this correctly?