I understand how gradient descent works with a single training example, but I’m struggling to grasp how it operates across an entire training set. I’m also confused about some terms like “iteration” and “epoch”. Can someone confirm if my understanding below is correct?
For batch gradient descent:
initialize all weights and biases in all layers of the neural network
for each training example in the data, calculate the loss function which is a single number
take the average of all loss functions, which is called the cost function. It is also a single number
calculate gradients of the cost function with respect to each parameter (each weight and bias), then update parameters accordingly. This is the end of an “iteration”, and also an “epoch” (for batch gradient descent)
repeat steps 2 through 4 until convergence (or a stopping criterion is met)
Could you elaborate on “compute the gradients for each example, and sum them all together”? Do you mean to sum the gradients of each parameter for each example and what do you do after getting the sum?
The gradients are used by an algorithm like gradient descent, to modify the weight and biases so that the next iteration will be closer to the minimum cost.
Those terms are relevant once we start doing minibatch gradient descent, which has not yet been introduced yet here in DLS C1. We’ll learn about that on DLS C2. But here’s a quick explanation:
We take the entire training set and divide it randomly into smaller subsets called “minibatches”. Typically we choose a minibatch size that is relatively small like 16 or 32 samples per minibatch. But we can go all the way to minibatch size of 1, which is called “stochastic” gradient descent.
Then one “iteration” consists of running forward propagation with one minibatch and then computing the gradients of the cost w.r.t. all the parameters. The cost is the average of the loss values over all the samples in the minibatch. So that means the gradients are the average of the gradients across the loss values for each sample. The derivative of the average is the average of the derivatives, right? Think about that for a minute and it should make sense. Taking an average is a “linear” operation and taking derivatives is also linear. Then we apply the gradients to update the parameters. And move on to the next iteration …
Then one “epoch” of training is a complete pass through all the minibatches in the entire training set.
Well, I have to be a little careful here: what I’m explaining is Prof Ng’s terminology in DLS C2. I have not studied the various TF optimization methods in any detail. They have one called SGD, but I don’t think that is the same thing that Prof Ng means by Stochastic Gradient Descent. From a cursory reading of the docpage, I think the TF SGD optimizer is using momentum.
What Prof Ng describes in DLS C2 is that you have two independent choices:
The size of your minibatches.
The optimization method you choose to smooth the gradient updates.
Of course on item 2), one choice is just to apply the gradients as directly computed from the minibatches without any smoothing.
At least that is my interpretation of what he said.
I think your steps are fine, and I want to highlight that computing a single cost value and a single gradient value are two different paths using two different equations (though the gradient equation is derived from the cost equation). Consequently, your step 4 and step 3 are independent from each other, and, therefore, just like you will average over all samples for the cost, as Tom said, we will also need to average over all samples for the gradient.