I have a small doubt… Suppose if I have 5 layers in my NN and I am using Gradient Descent with Momentum and Mini Batch…
My understanding is, For each mini batch that’s passed, We update the weights for each layers individually for its weighted average. The key thing which I am having doubt is, We are computing the last 10 (assuming beta=0.9) gradient average for each layer and updating that layers weights. So, when I am in Layer 3, I compute the last 10 gradients of layer 3 weighted average and then subtract it. Is that right ? Or are we mixing this between the layers ?