Nesterov momentum acceleration

In the programming assignment for week 2 in course 2, we saw the Adam optimization reached a minimum much more quickly than either momentum or RMS prop. I’m curious if anyone has experience with using Nesterov momentum acceleration, and how it compares to Adam.

Nesterov is a small adjustment to regular momentum optimization… the only difference is that when updating the momentum term, you use the gradient computed at the location you would be if you stepped in the direction of the current momentum, rather than at the current location. You then modify the momentum using this predicted gradient, and step in the direction of this modified momentum. In other words, you steer the momentum towards the gradient computed ahead of you in the direction of your current momentum, instead of towards the gradient computed where you’re at.

I read about this in Hands On Machine Learning (3rd ed), which I’m finding to be a helpful adjunct to this course. I have no connection to the book and have no interest in promoting it, except to help fellow classmates here who might also benefit from it as an advanced resource.

Thanks for mentioning this topic. I’ve never heard of Nesterov Momentum before, so I don’t have any experience with it. Google turns up lots of hits, as one would expect. The top one is to Jason Brownlee’s website, which is always a great place to go for ML topics.

But you’ve got a notebook that shows regular momentum and Adam optimization and compares them. Why not add a few cells and implement Nesterov Momentum and use the same comparison logic? If you’re concerned about the added code breaking the grader, you could create an experimental notebook under a different name …

I didn’t modify that notebook, but I did test this using tensorflow in the notebook for course 2, week 3. I found that Nesterov was a little bit faster to compute than Adam, but didn’t give quite as good a result on the test set. The graphs were interesting, showing some oscillations. I used process_time for timing, and
optimizer = tf.keras.optimizers.SGD(learning_rate, momentum=0.9, nesterov=True)
was my optimizer. I tried varying the learning_rate and the momentum up an down, but these hyperparameters worked best.

I’d like to paste in the graphs, but I don’t know how to do that. I try “copy cell attachment” but pasting that just gives me a giant block of text.

Here is the text of the epoch output, anyway.
Cost after epoch 0: 1.836860
Train accuracy: tf.Tensor(0.18055555, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.28333333, shape=(), dtype=float32)
Cost after epoch 10: 1.282068
Train accuracy: tf.Tensor(0.5009259, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.49166667, shape=(), dtype=float32)
Cost after epoch 20: 1.090269
Train accuracy: tf.Tensor(0.57685184, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.31666666, shape=(), dtype=float32)
Cost after epoch 30: 0.998442
Train accuracy: tf.Tensor(0.60925925, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.53333336, shape=(), dtype=float32)
Cost after epoch 40: 0.908589
Train accuracy: tf.Tensor(0.662963, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.55833334, shape=(), dtype=float32)
Cost after epoch 50: 0.857781
Train accuracy: tf.Tensor(0.69074076, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.60833335, shape=(), dtype=float32)
Cost after epoch 60: 0.752651
Train accuracy: tf.Tensor(0.73425925, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.60833335, shape=(), dtype=float32)
Cost after epoch 70: 0.897831
Train accuracy: tf.Tensor(0.675, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.59166664, shape=(), dtype=float32)
Cost after epoch 80: 0.684091
Train accuracy: tf.Tensor(0.7712963, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.65, shape=(), dtype=float32)
Cost after epoch 90: 0.785952
Train accuracy: tf.Tensor(0.7212963, shape=(), dtype=float32)
Test_accuracy: tf.Tensor(0.675, shape=(), dtype=float32)
67.626487752 secs

Very cool! Thanks for doing the experiments and sharing your results. It’s a good point that we don’t really need to build all this stuff for ourselves once you have TF available.

I’ve seen people include graphs natively, but haven’t tried that myself. One “cheesy” way would be to take a screenshot of the graph and then use the little “Up Arrow” tool to include it.

I downloaded the image as a .png and then uploaded.


Interesting! For comparison, what do the graphs for Adam look like? Is this the same dataset and network architecture used in the Optimization assignment in C2 W2? Of course you’re doing SGD, but they used minibatch with batchsize 64 there, so the results would not be comparable in any case.

I don’t think the dataset is the same. C2W3 uses sign language images and softmax. What I did was simply to change the optimizer for the final model run in C2W3.
Here are the Adam graphs.


Really interesting. As you say, the behavior is much smoother with Adam. Although it looks like the end results in terms of accuracy were pretty close in the two cases. In the Adam case, the accuracy curves are smoothly trending in a good direction. It would be worth trying a few more iterations and see if the results improve further. Then there’s still the “meta” question of how you play off the lower compute cost of Nesterov Momentum vs Adam. If Adam gives genuinely higher accuracy numbers in a more predictable way, then my reaction would be that trumps the compute cost unless the difference is pretty extreme.

Thanks very much for sharing your results!

I played around a little with this. The Nesterov improves a little more up until 120 epochs, then the performance declines. I speculate that learning rate decay, or momentum decay, might help, since it seems to be having problems once it gets near the minimum. Playing around with even this small model gives me appreciation for the complexity of the landscape. Hyperparameters are one thing, but then there is the possibility of dynamics on the hyperparameters, and possibly different dynamics for different ones, possibly more complex than simply decay, perhaps the dynamics are dependent on other observed variables like cost function velocity, and everything dependent on the exact choice of architecture, training set, and on and on! We are certainly in a dark art era of machine learning.

Yes, the solution spaces here are incredibly complex with lots of non-linear coupling all over the place. Even before you start fiddling with hyperparameters :scream_cat:

Prof Ng does spend quite a bit of time here in Course 2 talking about how to approach tuning hyperparameters in a systematic way. But, as you say, there is still some “art” to it.

The phrase “solution space” always triggers memories of this thread, which is worth a look for the reference to the LeCun paper mentioned there.

Yes, he talks a little about searching for hyperparameters, but doesn’t mention even the idea of dynamics of hyperparameters, except maybe a mention of learning rate decay in passing at one point (not sure, seems like I remember him saying something about it).
I may sound negative here, but I don’t mean to be… I think this is all very exciting, and a “dark art” field is one which is fertile and ripe for opportunity for new discoveries.

I remember at least one section where he talks about interactions between hyperparameters and the advantages of orthogonality and how to organize the selection of multiple hyperparameters not by doing a “grid search” but by using random selection of points in the grid. In my notes, this comes pretty soon after the learning rate decay lecture, so maybe you haven’t gotten to it yet.

There’s also a lot more discussion of what to do when your model doesn’t perform as well as you need it to in Course 3, although there managing the datasets is also a major focus as opposed to simply adjusting hyperparameters.

But your point about this being a fertile field for new discoveries is also a great way to look at it. Definitely worth some attention once you’ve heard all that Prof Ng says on the various aspects of this.

I want to share a good resource for learning about Nesterov optimization (and many other things as well), in the slides of Shubhendu Trivedi for lecture 6 of the 2017 University of Chicago Deep Learning course. (Can’t seem to edit the original thread starter, so hopefully anyone interested will find this comment here.)
Many other sets of slides for the course are also available, I found them to be very cogent. Obviously five years old, but for most of the topics discussed this doesn’t matter.
CMSC 35246 Deep Learning - University of Chicago