Why do we need to have a validation set for training?

I was reading a blog after watching videos of the course in this week and found that cross validation set is used with model.fit and the information from this is often incorporated in the model during the training process.

Arent we suppose to only use the training data X_train to optimise the weights? Also if Val set is used we can concatenate it with the X_train, something like model.fit(X_train.concat(X_val), y_train.concat(y_val))

Basically this line from the above mentioned article is not clear to me

the model occasionally sees this data, but never does it “Learn ” from this. We use the validation set results, and update higher level hyperparameters. So the validation set affects a model, but only indirectly.

In the 2nd video I learnt that CV dataset is supposed to be used to compare different models and then test data is supposed to be used to test whether the selected model (having least error on CV data) is general or not, the testing dataset is used.

Yes the model does not train on CV data it only gets tested in the training phase, once you see the performance on the CV data then you go back to the model change hyperparameters such as learning rate, model architecture, batch size etc… and then train again the model with the new hyperparameters (this is considered to be a new model here) on the train set. Then again test on CV data. If you are at some point satisfied with CV data performance, you dont go back to change model hyperparameters but go ahead and test on a third set, the test set.

2 Likes

During training, the model “Learns” - outcome of the Learning process are the final values of w’s and b’s. A dataset is a “Training” dataset, if the values of w and b are updated/modified directly based on how the model performs on that dataset.

The performance of the trained model is checked against the CV set. The CV set performance is used to select the best model, with respect to the hyperparameter values…but the “Learning” is still done purely on the training set.

1 Like

Note that changing the architecture or hyperparameter(s) as a result of information derived from the validation set will result in new values of w and b when a new model is trained. I suggest both the new weights and the improved model architecture reflect learning, and what you probably want is a model that performs well against the validation set (and ultimately against the test set) even if that means it performs worse against the training set. If learning only occurred against the training set, then everyone would completely overfit every time, no?

So it is like learning from examples then testing your learning from practice problem. If there is error or mismatch, again learn from the training example an test your learning from practice problem until you are confident enough to actually give the exam (test data). If you are passed then it is like you are graduated and now you need to implement your learning on the real world data (ofcourse it can be wrong not 100% accurate). Am i thinking right @gent.spah ?

Isn’t hyper params set before training usually at model.compile and keras.Sequential? Then how come it is changed while learning? I know LR is changed by Adam, but lets not discuss that

@ai_curious

Granted that changing the architecture will definitely impact the values of w and b. However, the values of w and b are updated purely based on the derivative of cost that was calculated on the training set. The cost on the CV set does not play any direct part in the update equation for w and b - Hence the mention of CV set playing an indirect part.

By “Learning”, i meant for example:- the gradient descent algorithm, which as mentioned above, looks only at the training set. But I see your point about the term “Learning”, in the purer sense, being applied in a wider all-encompassing context.

So you mean when we do model.fit(… validation_split=0.3) or model.fit(… validation_data=(x_cv, y_cv)) then we are selecting multiple models at each epoch? I would argue: No it is not the case, because different model mean different config like # layers, # units, activation and etc

Idk I might be wrong, please correct me if I am wrong

No, we complete the training and then check on the CV set.

1 Like

In the net training at model.fit for each epoch, I would say for each step the val_loss is changed. Is it because of change in X_train batch or the updates in the params after each step?

Agree completely. One might argue that since current common practice is for humans to learn from the validation set and manually implement changes, then it isn’t strictly machine learning. But clearly some types of learning algorithms have done this type of automation for a long time, and there is no reason that validation set accuracy couldn’t be incorporated into a second level of loss and optimization. This could either be random mutation or itself trained / learned.

Would that be grid search, or “Bayesian optimization” (that assumes the cost surface to be e.g. gaussian process)?

1 Like

Yeah your thinking is right.

You either change them manually or you may change some of them using grid search or other techniques but for our purposes here lets say manually.

You are correct that some optimizers ^{*} dynamically change some parameter values during training of a single model, but I think for the most part that is not what anyone on the thread is discussing. We’re talking about feedback incorporated into training of a completely separate model.

^{*} You can also achieve this using the Keras Learning Rate Scheduler mechanism https://keras.io/api/callbacks/learning_rate_scheduler/

Areas for ongoing research. See for example
https://dl.acm.org/doi/pdf/10.1145/3292500.3330648

Auto-Keras: An Efficient Neural Architecture Search System

which does propose Bayesian optimization. One overview and discussion of Neural Architecture Search (NAS) strategies is here https://arxiv.org/pdf/2006.02903.pdf
A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions

Interesting. I spent some time on the first paper you suggested, and it actually released a package called Auto-Keras, and this is a tutorial of it. Seems pretty easy to use. I should give it a try some time.

Thank you, @ai_curious!

Raymond

1 Like

@tbhaxor et al

I think folks on this thread are in violent agreement about this topic, but I saw a related question on StackExchange today that prompted me to go read a pretty famous text book I have a copy of, (see the link at the bottom)

Here is some great stuff from chapter 7. Model Assessment and Selection…

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

Training error is the average loss over the training sample

\displaystyle err = \frac{1}{N}\sum_{i=1}^{N}L(y_i,\hat{f}(x_i))

We would like to know the expected test error of our estimated model \hat{f}.[NOTE these classes typically use \hat{y} instead of \hat{f}] As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. There is some intermediate model complexity that gives minimum expected test error.

Typically our model will have a tuning parameter or parameters \alpha and so we can write our predictions as \hat{f}_\alpha (x). The tuning parameter varies the complexity of our model, and we wish to find the value of \alpha that minimizes error, that is, produces the minimum of the average test error.

It is important to note that there are in fact two separate goals that we might have in mind:

  • Model selection: estimating the performance of different models in order to choose the best one.

  • Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

It is difficult to give a general rule on how to choose the number of observations in each of the three parts, as this depends on the signal-to- noise ratio in the data and the training sample size. A typical split might be 50% for training, and 25% each for validation and testing.

Yeah, that.

Models are trained on the training set. Then the validation set is used to perform model selection by estimating prediction error of the models trained with differing \alpha. Without a validation set, one might select a model based only on minimizing training error, which leads to overfitting ie high variance. Finally, the test set is used to perform model assessment, estimating the generalization error of the selected model.

Elements of Statistical Learning, 2^{nd} Edition. Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.

1 Like

So according to me it is like optimising the parameters based on the training data (J_{train}) and then doing a “small” testing to validate if the model will perform good on the testing data or not. This is just to save time from completing the model and then evaluating.

Because while learning tensorflow callbacks, I see it is advised to use val_loss infact it is default in the official documentation

tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0,
    patience=0,
    verbose=0,
    mode='auto',
    baseline=None,
    restore_best_weights=False,
    start_from_epoch=0
)

The training data directly affect all the parameters, but since it can be stopped by the above callback, it indirectly affect (no gradient optimisation, like J_{train})

It is not really to do a small test to check whether it will perform well on the test set. It is more about selecting the best model as has been mentioned by others.

In the case of deep learning, you can have a set of losses that you want to try, you may wonder about the best backbone, neck, or head or whether you need to add any type of regularisation to the mix among other things. Another parameter for any model trained with an iterative approach is to know when to stop (the early stopping is one way as you mentioned).

Many of these parameters can make the model overfit, so the training dataset cannot be used to compare them, thus, you need to select another dataset for this purpose. Now, you may be tempted to use the test dataset, but the problem is that you are selecting the best model among several options, therefore these data will get biased, you are actually training, just with a different framework. So, you need a separate dataset to perform this selection and this is the validation dataset.

By the way, a model is composed of a set of parameters and logic/architecture that defines how to use them. So, at each iteration during the training process, we have a different model and we also have different ones by changing any other hyperparameter. What I want to say is that model selection and hyper-parameter selection are actually the same thing, just in case it was not clear.

Hope it helps