Dear colleagues,
Greetings!
I am unable what roles different optimization algorithms play and what their advantages and disadvantages are. I have multiple questions regarding these:

Why the gradient descent is represented using ellipses? Is the major axis for ‘W’ and the minor axis for ‘b’? There can be tens or hundreds or thousands of features, these weights and biases may not form a simple ellipse. What is the significance of ellipse?

As I understand from the lectures, these algos reduce oscillations on certain axes and accelerate the progress towards globally minimal cost. How is that achieved in RMSprop by dividing by Sdw + epsilon?

Similarly, how does Gradient descent with momentum achieve the same result?

Why ADAM algorithm works better than both of the above? How does it combine the benefits of both those algorithms?

We implement optimization either to reduce the loss (w.r.t cost function) or to increase the accuracy.

Here’s a good read that can help you out in understanding the logic behind it’s use.

GD works on the principle of minimizing the error by summing over the euclidean distance between the points on the ellipse. Parameters W and b that you are talking about depends on axes y and x . Here, b is the bias or the intercept and w is the weight/slope and the value changes as per features x and label y. A related link follows on this topic that will clarify your understanding more.

To your second query how oscillations on axes work: it depends on updating the parameters in a way that indicates the movement of the direction of weights towards higher endpoints than that of the ‘b’.

Tuning in with hyperparameters using GD optimization could lead in converging to a global minima. In this link, you can get an idea how it is achieved! There’s also this topic that covers the similar understanding.

Yes, definitely Adam optimization is one of such unique techniques that has been proven effectively across many wide spectrum. Here’s to know why?

I have tried to cover all your points. In case, I am left with anything that is untouched here or not understandable, please let me know and we can have it discuss here!

In addition to Rashmi’s excellent and detailed response, the point about the pictures is that (exactly as you say) everything here really involves hundreds, thousands or even millions or billions of dimensions. Unfortunately, you can’t draw pictures in more than 3 dimensions and our poor human brains (or at least my poor human brain anyway ) are only capable of visualizing things in 3 dimensions. So the pictures are a very very limited attempt to give some intuition using the extreme limitations of visualizing things in 3D. In addition to all the great links that Rashmi included, here’s a thread that has some more discussion about local minima and includes a link to a paper from Yann LeCun’s group about the complexity of loss surfaces.

Thank you, @Rashmi and @paulinpaloalto, for your detailed responses! Much appreciated.
I am not getting the reasoning/intuition behind the approaches. For instance, in RMSProp, we are multiplying Sdw by B (beta) and dW^2 by (1-B). This approach reduces the emphasis on the current iteration since 1-B is smaller than B for all values of B greater than 0.5. I am unable to understand the following:

Why emphasising the past values expedite the gradient descent?

Why the dW squared term is helpful? Why is it not dW^e or even dW^2.5?
I have similar questions about Adam optimization as well.

I don’t know whether I am being dense. But, if you can explain, I will be grateful. But, if you think these questions are not relevant, I will close the thread there. Thank you!

I was only addressing the question of the ellipses in my previous response. It was Rashmi who was doing the “heavy lifting”. But I’ll try to take my turn:

It’s been a while since I watched the relevant lectures here, but I remember Prof Ng spending quite a bit of time on this point. As I recall, the general idea is that you are dealing with gradients that have fairly erratic behavior either because the surfaces are really bizarre or because you get additional stochastic behavior by using small minibatches or both. If you have trouble with convergence because the gradients are erratic, then he describes various “smoothing” techniques all derived from the general idea of exponentially weighted averages. EWA gives you a tunable hyperparameter to control how many past iterations you allow to influence the current iteration before fading out. There are a number of techniques that he presents that are different variations on that idea: getting better convergence by using some form of averaging over recent values or the like to damp down the variability.

These are techniques that various people have come up with and then shown that they work well in some cases. I have not looked at any of the math behind this, but you could certainly try experiments using different exponents for the dW term and see how that affects the results. You might discover the next generation algorithm and have a paper to publish! The other high level point worth noticing here is that everything we are learning feels like practical advice as opposed to deep theoretical results. In other words, we’re doing experimental physics here, not theoretical physics. There is no one “silver bullet” solution that is guaranteed to give the best result in all cases.

Which is all a longwinded way of saying that I don’t really know the answers here. My suggestion would be to watch the relevant lectures again with your questions in mind. If you don’t find what you are looking for there, try reading some of the original papers that describe the techniques.