I was just wondering whether this formula is correct because we take the previous layers velocity and multiply by beta instead of current layers velocity in weighted moving average
1 Like
Hi @Syed_Taha ,
Where do you see the previous layers velocity in the formula?
2 Likes
Hi @Syed_Taha ,
The formula from your first post is the formula for gradient descent with moment which is one of the optimization algorithms used in machine learning. If you refer back to the video lecture, you would hear Prof Ng talked about the advantage of averaging the gradients to help finding the global minimum faster and less oscillation. This averaging technique is the ‘exponentially weight averages’. You would also hear Prof Ng talked about the V_t
is taken at iteration t
when running a mini-batch. Attached is a couple of screenshots for your reference.