In neural network, the video titled, “training details”, Andrew says:

Given this definition of a loss function, we then define the cost function, and the cost function was a function of the parameters W and B, and that was just the average that is taking an average overall M training examples of the loss function computed on the M training examples, X1, Y1 through XMYM, and remember that in the convention we’re using the loss function is a function of the output of the learning algorithm and the ground truth label as computed over a single training example whereas the cost function J is an average of the loss function computed over your entire training set.

What is he trying to say?

I think Andrew is trying to show this flow:

(0) given dataset (\vec{x}_1, \vec{x}_2, …, \vec{x}_m, and y_1, y_2, …, y_m), and

(1) we have f_{\vec{w}, b}(\vec{x}), then

(2a) we can calculate the loss of each data sample L(f_{\vec{w}, b}(\vec{x^{(i)}}), y^{(i)}), then

(2b) we can sum them up to get to the cost of all the samples \frac{1}{m}\sum_{i=1}^{m}{L(f_{\vec{w}, b}(\vec{x^{(i)}}), y^{(i)})}, then

(3) with the cost specified, we can train the model on data that minimizes the cost.

so we can see that how a next step uses the result from the previous step. The script you quoted focuses on (2a) and (2b). i think he is giving us an overview on the training steps he is going into next, as the training details has a lot of codes but we want to remember the whole idea so we don’t get loss in the code.

Let us know if you have more questions

Cheers!

Hey @roystonlmq,

Just to add to Raymond’s explanation, in this part of the lecture video, Prof Andrew wants to emphasize the **difference between cost function and loss function**. This is slightly controversial in the research community, but Prof Andrew has presented his opinion, and what he will be considering throughout the specialization, and his opinion is as follows:

**Loss function** represents the error for a single example, i.e., difference between the prediction and the target for a single example, i.e., L and **Cost function** represents the average error for the entire training set, i.e., \frac{1}{m} \sum_{i = 1}^M L.

Now, just to cover all the directions, this definition of cost function remains valid as long as we are using **batch gradient descent** (*which is used in the course*), i.e., using the entire training set for a single iteration of gradient descent. If we are using something known as **mini-batch gradient descent**, i.e., using only a subset of the dataset for a single iteration of gradient descent (*something which hasn’t been discussed in the course*), in this case, the average will be over the errors for the considered subset of training set, instead of the entire training set.

**P.S. - You will find a slight mention to the concept of batches in one of the MLS C2 Assignments.**

Feel free to leave out the last part, if it doesn’t make much sense to you. I hope this helps.

Regards,

Elemento

2 Likes

Thank you all. Appreciate the effort into both posts!

1 Like