Optimizer and different performances

For the training step f the final assignment C2W1 dogs vs cats classification, If I use the RMSprop optimizer with learning rate = 0.001, I have back the following plot for accuracy per epoch (training in red and validation in blue):


And the loss per epoch :


With this type of optimizer (please, correct me if I am wrong):

  • I do not reach the required training and validation accuracy
  • it is fast (I complete the training after almost 20 minutes)
  • After 8 epochs I can notice a kind of plateau in the training accuracy. Is it a sign my convnet is not learning any longer?
  • After 12 epochs, the training loss seems to be increasing a little bit. it’s a bad sign, isn’t?
  • Validation’s accuracy and loss seem to not be stable yet, should I increase the number of epochs (but my net is not learning anymore, right?)?

On other hand, If I use the ADAM optimizer, I can observe:

  • I reach the required training and validation accuracy at epoch 8
  • Training process is incredibly slow: after almost 2 hours it is still running!

On the C2_W1_Lab_1_cats_vs_dogs notebook, I read:

"[…] using the RMSprop optimization algorithm is preferable to stochastic gradient descent (SGD), because RMSprop automates learning-rate tuning for us. (Other optimizers, such as Adam and Adagrad, also automatically adapt the learning rate during training, and would work equally well here.)

Are ADAM and RMSprop equivalent in performance?
Why does ADAM seem to work better than RMSprop?
Should I know some criteria that lead me to choose an optimizer instead of another?

Hi Gloria,

Regarding why RMSProp is faster than Adam, I don’t have a good answer to that. Adam optimizer came in later and tries to bring together the best of momentum optimization and RMSProp but that shouldn’t lead to the time difference you are seeing. May be some one else can shed some light on it.

With only few exceptions ADAM should give better results than RMSProp. In the book - Hands on Machine Learning with Scikit-Learn, Keras and Tensorflow, the author provides a good set of defaults for training a neural network which usually (but not always) would lead to good answer:

  • Initialization – He Initialization
  • Activation function – ELU
  • Normalization – Batch Normalization
  • Regularization – Dropout
  • Optimizer – ADAM
  • Learning Rate Schedule – None

To answer some of your specific questions:

  1. After 8 epochs, the plateau in training accuracy does seems to indicate that the model is no longer learning anything useful, but the plot is only for 6 epochs after that, I would wait a little longer before I conclude that. If the training time is high and we aren’t sure how many epochs we will require, early-stopping is a good technique. I often just specify a very high number of epochs and pass in early stopping in callback, and ask it to break if for 15-20 epochs there’s no improvement in the validation accuracy.
  2. Increase in training loss is a bad sign but again I wouldn’t conclude anything on the basis of 2 epochs, I would wait a bit more.
  3. Since validation accuracy and loss aren’t stable yet I would recommend waiting a bit more.
  4. ADAM will usually (but not always) outperform RMSProp, as it came after RMSProp and enhances it.
  5. Criteria to choose: you could start with the list I mentioned earlier from the book Hands on Machine learning. Those are good defaults to begin with, if it doesn’t provide the required level of accuracy then it’ll be good to experiment.

Hope this helps.

Hi SomeshChatterjee,

Many thanks for your reply and explanation.
I have the book you suggested to me: I am going to study from there (Chapter 11 in particular I guess)

I haven’t thought of increasing the number of epochs because accuracy looked stable to me from one epoch to another and maybe this would have led me to overfit my model (Is this correct?)…

It seems, when you are in doubt, increasing the number of epochs is always a good solution (and implementing early stopping as well)