I understand that dw1 and dw2 etc will come finally after m examples then w1=w1-a*dw1 but that will be just one step how will w1 etc be fully optimised . Are different sets of m training examples taken?

Hi,

yes you are correct, the weight of each neuron is only updated once per iteration. That’s why usually optimizing J, you have to go through multiples passes (epochs).

When m is large, this is not super effective because you have to wait very long (a full pass on all examples) to make only one step. So in this case, we would prefer using mini-batch gradient descent, that deals with spliting the m examples into multiple mini-batches, of smaller length. The overall convergence is still guaranteed mathematically, and you will get some initial results much faster. (this topic is covered in the course later)

1 Like