Do we use value of w of the previous batch for next batch in Mini batch gradient descent

In minibatch gradient decent do we use updated W from the previous batch and use it as an input to next batch or we calculate W for every batcha nd then we take average ?

1 Like

Yes, that is why minibatch GD is useful in most cases: we are applying the updates to the parameters (all the W and b values) after every minibatch. So the updates happen more quickly and, even though they may have a bit more statistical “noise” in them, the overall convergence happens after fewer total “epochs” or passes through the entire training dataset.

3 Likes


In this exponential Wweighted averages, does it mean that
w[for next batch] = beta*W[from previousbatch] + (1-Beta)*W[current batch]
and beta will adjust number of previous batches to consider.
Why are we using dW instead of W. We want to get best value of W after every mini batch and dW is just telling change.

The slide you are showing is the generic implementation of exponentially weighted averages. The theta values there are temperatures sampled at some time interval.

If the question is about how we use EWAs to implement minibatch gradient descent, here’s a more directly relevant slide from the lecture on Momentum:

There we have the input values dW computed from the current minibatch and W which is the updated W from the last minibatch. Then we compute the “smoothed” version of dW by taking the EWA of the last few values of dW by using vdW and extending that average by the new dW from this minibatch. We then apply that in the update formula to get the new value of W after the current minibatch. And similarly for b and db.