In this notebook in the title, in the Classification section (last part), the best model was the second neural network. I don’t understand why the second neural network performed better than the third neural network. I thought a larger neural network can in most cases provide better performance. Please advise.
The larger model can overfit, performing well in training but poor in tests. Large models can have vanishing and exploding gradient problems too. So, it cannot be guaranteed that the larger model is better. Check the accuracy or loss (on testing) of the 2nd and 3rd models. Which one is better?
Here are the evaluation metrics from the notebook:
Model 1: Training Set Classification Error: 0.44167, CV Set Classification Error: 0.47500
Model 2: Training Set Classification Error: 0.11667, CV Set Classification Error: 0.07500
Model 3: Training Set Classification Error: 0.41667, CV Set Classification Error: 0.47500
As you can see all the models have similar training errors and cross-validation errors. So it may not be an overfitting problem.
Regarding the the statement ‘Large models can have vanishing and exploding gradient problems too’ can you help me understand this more? or a link that goes over this?
Below is the function from the notebook that defines the three models:
def build_models():
tf.random.set_seed(20)
model_1 = Sequential(
[
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(1, activation = 'linear')
],
name='model_1'
)
model_2 = Sequential(
[
Dense(20, activation = 'relu'),
Dense(12, activation = 'relu'),
Dense(12, activation = 'relu'),
Dense(20, activation = 'relu'),
Dense(1, activation = 'linear')
],
name='model_2'
)
model_3 = Sequential(
[
Dense(32, activation = 'relu'),
Dense(16, activation = 'relu'),
Dense(8, activation = 'relu'),
Dense(4, activation = 'relu'),
Dense(12, activation = 'relu'),
Dense(1, activation = 'linear')
],
name='model_3'
)
model_list = [model_1, model_2, model_3]
return model_list
Which one is better? Hint: Good model error is small.
You will learn more about it in DLS course 2. I don’t remember if MLS covers this or not. Maybe.
My question was why did Model 3 perform poorly? Sounds like that cannot be answered by just looking at the layers in the model_3.
As the number of layers increases, there is chance that performance will decrease. Try to increase the number of layers (of the third model), you will see that its error is (maybe) more than the 3rd model.
Now the question is, why this happens? There may be multiple answers but the one I mentioned earlier is vanishing or exploding gradient.
Vanishing means the slope (gradient) becomes too small to update the parameters, so it stuck. Exploding means the slope becomes too large and jumps back and forth, never converging.
Also, increasing the number of layers may also need to tweak other hyperparameters, like learning rate, number of iterations, etc. I am not sure about that.