In neural network, the video titled, “training details”, Andrew says:
Given this definition of a loss function, we then define the cost function, and the cost function was a function of the parameters W and B, and that was just the average that is taking an average overall M training examples of the loss function computed on the M training examples, X1, Y1 through XMYM, and remember that in the convention we’re using the loss function is a function of the output of the learning algorithm and the ground truth label as computed over a single training example whereas the cost function J is an average of the loss function computed over your entire training set.
What is he trying to say?
I think Andrew is trying to show this flow:
(0) given dataset (\vec{x}_1, \vec{x}_2, …, \vec{x}_m, and y_1, y_2, …, y_m), and
(1) we have f_{\vec{w}, b}(\vec{x}), then
(2a) we can calculate the loss of each data sample L(f_{\vec{w}, b}(\vec{x^{(i)}}), y^{(i)}), then
(2b) we can sum them up to get to the cost of all the samples \frac{1}{m}\sum_{i=1}^{m}{L(f_{\vec{w}, b}(\vec{x^{(i)}}), y^{(i)})}, then
(3) with the cost specified, we can train the model on data that minimizes the cost.
so we can see that how a next step uses the result from the previous step. The script you quoted focuses on (2a) and (2b). i think he is giving us an overview on the training steps he is going into next, as the training details has a lot of codes but we want to remember the whole idea so we don’t get loss in the code.
Let us know if you have more questions 
Cheers!
Hey @roystonlmq,
Just to add to Raymond’s explanation, in this part of the lecture video, Prof Andrew wants to emphasize the difference between cost function and loss function. This is slightly controversial in the research community, but Prof Andrew has presented his opinion, and what he will be considering throughout the specialization, and his opinion is as follows:
Loss function represents the error for a single example, i.e., difference between the prediction and the target for a single example, i.e., L and Cost function represents the average error for the entire training set, i.e., \frac{1}{m} \sum_{i = 1}^M L.
Now, just to cover all the directions, this definition of cost function remains valid as long as we are using batch gradient descent (which is used in the course), i.e., using the entire training set for a single iteration of gradient descent. If we are using something known as mini-batch gradient descent, i.e., using only a subset of the dataset for a single iteration of gradient descent (something which hasn’t been discussed in the course), in this case, the average will be over the errors for the considered subset of training set, instead of the entire training set.
P.S. - You will find a slight mention to the concept of batches in one of the MLS C2 Assignments.
Feel free to leave out the last part, if it doesn’t make much sense to you. I hope this helps.
Regards,
Elemento
2 Likes
Thank you all. Appreciate the effort into both posts!
1 Like