I have prepared a solution for week 1 assignment cats vs dogs but somehow training accuracy is not improving. Validation accuracy is above 80% but training accuracy is stuck at c. 86% to 87%. I have tried various structures such as adding more layers as shown below but nothing is working. Can you please help me figure out what is missing?
[Removed code]
below is my output -
Epoch 5/15
2250/2250 [==============================] - 94s 42ms/step - loss: 0.3982 - accuracy: 0.8366 - val_loss: 0.3962 - val_accuracy: 0.8308
Epoch 6/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3952 - accuracy: 0.8452 - val_loss: 0.4186 - val_accuracy: 0.8604
Epoch 7/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3907 - accuracy: 0.8476 - val_loss: 0.6279 - val_accuracy: 0.8260
Epoch 8/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3732 - accuracy: 0.8547 - val_loss: 0.3105 - val_accuracy: 0.8876
Epoch 9/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3777 - accuracy: 0.8552 - val_loss: 0.9783 - val_accuracy: 0.8384
Epoch 10/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3729 - accuracy: 0.8577 - val_loss: 0.3625 - val_accuracy: 0.8584
Epoch 11/15
2250/2250 [==============================] - 94s 42ms/step - loss: 0.3697 - accuracy: 0.8663 - val_loss: 0.2649 - val_accuracy: 0.8944
Epoch 12/15
2250/2250 [==============================] - 94s 42ms/step - loss: 0.3889 - accuracy: 0.8679 - val_loss: 0.3938 - val_accuracy: 0.8472
Epoch 13/15
2250/2250 [==============================] - 96s 43ms/step - loss: 0.3647 - accuracy: 0.8676 - val_loss: 0.3323 - val_accuracy: 0.8852
Epoch 14/15
2250/2250 [==============================] - 93s 41ms/step - loss: 0.3646 - accuracy: 0.8650 - val_loss: 0.4246 - val_accuracy: 0.8712
Epoch 15/15
2250/2250 [==============================] - 94s 42ms/step - loss: 0.3719 - accuracy: 0.8655 - val_loss: 0.2876 - val_accuracy: 0.8828
Those layers with softmax
are the likely culprit. That activation is not generally used for hidden layers in a binary problem. Try changing just that and let us know the results?
ps: to the best of my knowledge the reason is related to vanishing gradients and the difference between the output magnitude of relu
Vs softmax
(latter constrained to sum to 1.0). Maybe one of the math wonks can weigh in here?
No softmax part is commented. I had tried but it didnt work. Effectively its this (same network as shown in videos):
tf.keras.layers.Conv2D(16, (3,3), activation=‘relu’, input_shape=(150, 150, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
I think I solved it, just in place of RMS Proo used Adam and its working. Grateful if someone could clarify what is reason behind difference in performances of these two optimizers?
2 Likes
lots of discussions comparing and contrasting the two if you search the interweb on “rmsprop vs adam”. This one covers a lot of ground: An overview of gradient descent optimization algorithms
and includes this paragraph:
In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [14:1] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.
NOTE: my emphasis added
3 Likes
Thanks a lot! With RMS Prop, would it help if I reduce learning rate further? It feels like with rms prop the model is stuck at around 87% accuracy.
One of the true math wonks might have a better answer for you, but my own response is somewhere between ‘no’ and ‘it depends.’ ‘No’ because RMSprop doesn’t use a constant learning rate anyway and its value at any given iteration is driven by circumstances at that moment. The starting point you provide for learning rate is just that, the starting point. ‘It depends’ because learning rate isn’t the only factor at play. The data itself, number of epochs, batch size, all make a difference. You could experiment yourself and plot curves, say iteratively change starting learning rate but hold others constant, then change number of epochs, etc. Note that 15 is a rather small number of epochs and you may not know yet if you have truly reached an optimum or are just at a plateau. Let us know what you find?