I still have some questions about training set, validation set, and test set after the DSL course and some hands on experience. Here’s my understanding of it and I want to make sure if I don’t make any misunderstanding! The following are my takeaways from the course and my experience, please tell me if I got anything wrong. Please help me understand this topic a bit better!
Training set would change the parameters of my network when it fits the dataset, and validation set represents an unbiased evaluation of the network because it cannot change the parameters of the network. (I don’t know what cross-validation between models means though)
Test set is similar to the validation set but the NN doesn’t use it in training process. It is used for final evaluation of the performance of the network.
Sometimes not having a test set is okay because the validation set kind of plays a part of the test set as they are from the same distribution and are only used for model evaluation.
“Training set would change the parameters of my network when it fits the dataset”
Incorrect: Training set does not change parameters, it’s simply the data that’s used to train the model. The only thing that changes during training are the model weights (coefficients).
Notes: Parameter tuning is an engineering exercise that is an iterative process. You would first start with a benchmark model, evaluate accuracy and tune different parameters to reduce overfitting.
For example: Setting the number of perceptrons in a dense layer or choosing a loss function, adjusting the learning rate, are all examples of hyperparameter tuning.
There are a number of different strategies and software tools designed specifically to reduce overfitting. Once you start adjusting the parameters, you will notice changes and make improvements.
“validation set represents an unbiased evaluation of the network because it cannot change the parameters of the network”
Incorrect: Validation set is used during training to inform the model weights. If there is bias (high variance) in the data, you will get low accuracy.
“I don’t know what cross-validation between models means though”
Cross validation is a resampling procedure designed to observe how a model will generalize on independent data set
“Test set is similar to the validation set but the NN doesn’t use it in training process. It is used for final evaluation of the performance of the network.”
Correct
“Sometimes not having a test set is okay”
Incorrect: Test set is data that the model has not seen during training and is required to test accuracy of the model.
Thank both of you for your replies! It clarifies a lot of things in my head now! But I still have some questions regarding this topic.
According to the video in DLS course, I am tackling a problem that requires me to perform an analysis on different subsets (or sources) of the same kind of data.(Kind of like the example of web-page cats (large set) and consumer camera cats(small set)). Is it sensible for me to do the following:
Step 1. Inject some consumer cats examples to my training set which is primarily web-paged cats
– so that the NN can learn with a bigger dataset to achieve higher accuracy
Step 2. Use a higher percentage of consumer cat in my validation and test set