Clarity on Train/Test procedure

Looking at the first equation, which we should use to minimize the cost function, I noticed we are summing from i = 1 to i = m_train. This implies that only a particular percentage of the examples will be used to fit the model.

But the confusing part is this summation starts from (X^[i=1], y^[i=1]) and progresses consecutively in steps of i+1 to (X^[i=m_train], y^[i=m_train]). These sets of inputs might be different from X_train and y_train since the latter will be randomly selected. Is that exactly what the instructor intended? What would make the features and targets used for fitting different from those used to compute training error?

1 Like

Hello @Basit_Kareem!

You are right that we can draw training data from a larger data pool randomly, and if I just look at the slide, I think it meant to have already a separated set of training data and a separated set of test data, and they label the samples in each set from 1 respectively, so sample 1 in the training set is different from the sample 1 in the test set.

Fitting the model and computing the training error both use the training set. Computing the test error uses the test set.


@Basit_Kareem, let me just go through the procedure to make sure we are on the same page.

We start with a full set of the data, each sample’s X and y are paired up and they remain paired up all the time, so we won’t find a wrong y for any X.

Then we randomly pick a portion of the full set (say 80%) as the training data (both X and their respective y) and the remaining is left as the test set.

The test set is then held out from the whole model building process including the hyperparameters tuning process (it is a tuning if you are considering what polynomial features to add, for example).

In hyperparameters tuning we will further split the training data into a smaller training set and a cv set. The sampling process is again random. Then let’s say we have 10 different configurations of hyperparameters, we train each configuration with the smaller training set, and then finally evaluate each of them with the cv set.

We pick the one configuration with the best cv performance, and lastly evaluate it on the test set for the final report of your model performance.

The slide isn’t about hyperparameters tuning so it didn’t split the whole dataset into train/cv/test sets.


1 Like

But from the first equation, it wasn’t indicated that it’s the x_train and y_train that will be used. By writing y, it can be assumed that the entire datasets is intended to be used for the training

Ah, I see. So the thing is there is NO “train” in the subscripts of X and y in the first equation, right? I believe it is assumed because the summation in the first equation goes from 1 to m_{train}, so it’s got to be samples from the training data. I would rather say it would be better if the slide has put “train” in the subscript but otherwise my assumption is the only possibility. We can never mix any test data into the training process, otherwise, it is not a test data.



Also, if you go one slide back, you see Andrew explicity said we need to split the dataset into 2. And since we have split it, why would we mix it afterwards? Right?

Well, imagine I have my datasets stored in an m X n+1 matrix (the n+1th column to cater for the targets) called mput. So I then split then split the datasets into 2 sets named mput_train and mput_test.

mput_train contains random m_train X n+1 from the larger mput dataset while mput_test = mput(~ismember(mput_train, mput)).

Then you went on to define a function which computes the cost function based on a given datasets. Where you are supposed to pass mput_train, you passed mput.

The code would run successfully should you be using a for loop for the summation but the answer won’t be right.

So the point is to pass mput_train. Nobody will pass mput. Passing mput is wrong.


If the slide had put “train” in the subscript of X and y, are all your problems gone?

To illustrate,

x = [1, 6, 12, 20, 9, 3, 0]
y = [1, 2, 3, 4, 5, 6, 7]
m = len(y) = 7

x_train = [1, 12, 9, 3]
y_train = [1, 3, 5, 6]
m_train = len(y_train) = 4

sum_ytrain =0
for j in range(m_train)
sum_ytrain += y[j]

Do you notice I passed y which has 7 elements into the for loop instead of y_train which has 4 elements.

Now, the result I will getting for sum_ytrain will be 1+2+3+4 = 10 whereas what I needed was 1+3+5+6 = 15.

So this leads to my question, was it a mistake or what the instructor actually intended?

Then, I will be sure that’s what the professor meant. But if the statement is right as he wrote it, then I will know that’s how I will be implementing it

OK. I think this is a clearer representation:

Parameters fitting is through miniming the cost function on the training set. Only training data is used there. Test data has no place there.

The main confusion relates to a question I once asked here regarding evaluating the performance of a training dataset.

It doesn’t make sense that I minimised the error of a set of number to get its parameters, then I used the parameters to evaluate y_hat, I will now need to find the cost function again? Won’t that give me the cost function of the last iteration that gave me my optimized parameters? That would simply be the minimum cost function.

In short, by this analogy, the training dataset will always be performing well

Okay. Now, this equation and the third equation are the same except the regularization component which isn’t compulsory?

So why use a dataset to train a model and then find it’s error again?

Regularization isn’t compulsory. It is a tool that is up to you to use it or not. We use regularization if we observe an overfitting problem.

There are 2 ways of using the training data and testing data. Here comes the first one:

  1. Split a full dataset into TRAINING data and TESTING data.

  2. Put away the TESTING data for the time being. Nobody should touch it.

  3. Fit parameters with TRAINING data. Compute the cost value on the TRAINING data.

  4. After we finish training. Take the TESTING data back, but put away the TRAINING data.

  5. We compute the cost function on the TESTING data.

  6. Why do we compute SEPARATELY the cost on the TRAINING data and the cost on the TESTING data? Because we want to make a plot like the following so that we can inspect if we are having a high bias and/or a high variance problem. (Refer to videos in C2 W3 for details)

Here comes the second one:

  1. Split a full dataset into TRAINING data and TESTING data.

  2. let’s say you want to run N iterations in training your model.

  3. this marks the beginning of the 0-th iteration

  4. take the TRAINING data in, but put away the TESTING data.

  5. FIt parameters with the TRAINING data. This time, compute the cost on TRAINING data after the fitting.

  6. put away TRAINING data, and take the TESTING data back

  7. compute the cost on TESTING data.

  8. Why do we compute SEPARATELY the cost on the TRAINING data and the cost on the TESTING data? Because we want to keep track of the cost trends in both TRAINING and TEST data along the iterations so that we can discover EARLIER if there could be any bias or variance problem. In simple words, we don’t want to wait until the end of the training process. Also, we will know at which iteration we start to have a problem and then consider whether we should stop earlier. This is called “early stopping”.

  9. go back to step 3 to start the next iteration.

Is this clear? Note that I have NEVER mixed any training and testing data up. Nobody should.

Also, you will not find in anywhere we estimate the cost on the whole dataset (instead of separately on the training set and testing set). Don’t look for it in my reply. If you see someone said it, show me.


Hey @Basit_Kareem,

I have to go away for 2-3 hours, but when I come back, I will check out your reply in this thread.


1 Like

Wow. Okay. I will try this this out to see the difference