Learning rate. - course notes

Gradient Descent | Coursera

The learning rate looks like a specific arbitrary value (it is presented as how far we go each step). Looks to me that this value should be as big as possible (take a big step), as long as we don’t pass the lowest point. In such case, there could (should?) be a easy way to diminish it not to pass the lower value. Said otherwise, there should be a not complicated way to compute this value without the need for extra human arbitrary choice, which may cause bad or inefficient training.

1 Like

It’s a little more complicated than that. The problem is that the pictures people draw are too simple relative to what cost surfaces really look like. Taking big steps is dangerous because you are very likely to get divergence and not convergence. Here’s a paper about visualizing cost surfaces from Yann LeCun’s group.

You can try some experiments using large learning rates and compare the behavior across a range of learning rate values. Of course what works on a given problem may not work on a different one. You’ll get a couple of examples of solution networks in Week 4 of Course 1. Once you finish the programming assignments, you can then use them as a framework for trying this kind of experimentation.

There are also more sophisticated techniques than using a fixed learning rate, which Prof Ng will show us in Course 2 of this series.

2 Likes

Thanks!
However, I had made the point that, indeed, in order to make sure we don’t do it ourselves and risk divergence, first the system could take a big step to calculate, and then if indeed the step is too big, it adapts it (make learning rate smaller) just enough. Will check the solutions that you mentioned are further in the course and come back here if needed.

Meanwhile, the paper you mentioned doesn’t seem to relate to the classes I am talking about. In the class, we are clearly talking about a convex log function (ie where we can find a global minimum without any local minimum.)

1 Like

Hello @vmandrilly,

Think about this: how do we know the step is not too big? We only have the following info:

  1. The cost value at where we are

  2. The gradients at where we are. Theoretically, gradient values speak only about an infinitesimally small range.

Cheers,
Raymond

1 Like

Yes, there are adaptive techniques, although Prof Ng does not really discuss those. He does cover “learning rate decay” techniques in Course 2 Week 2.

We can take Raymond’s points and try to think through your “take a big step and then calculate” approach. Suppose you pick some “big” starting value and take one step in the direction of the negative gradient. Then you would calculate the cost J at your new location. How do you decide what to do next? If the cost went up, then what do you do? Start over or continue from the new worse location? Also note that if you take a big step, the cost might go down, but that could be because you jumped off the surface and onto a different “fold” in the surface. Did you take a look at that paper I linked above about visualizing cost surfaces?

1 Like

Hi! A step is too big when the cost increase instead of decreasing, which seems easy to calculate. In such case, we calculate how much learning rate is needed so that the cost doesn’t increase (looks like a multiplication or division operation) and continue with the new rate value.
You mentioned adaptative solutions, so it seems that some people already came up with an optimal solution.
I did look at the paper (I edited my first post so you might have missed it), and this paper is not related to the current problem of the current lesson, which is focusing on one logistic regression of one single ‘neuron’, and it is mentioned that we are in a sigmoid function that is supposed to be convex with no local minimas (only one global minima) see Logistic Regression Cost Function | Coursera @ min 3.18).

For cases with multiple minima (which again is not the case in our lesson) , I have no clue, but for sure if using gradient descent, we can’t rely on just start it in one single place, I guess, or we are likely to get stuck in a local minima, we need to first constrain all parameter values to be between let say 0 and 1 using a little transformation (if possible?), then try a gradient descent at regular intervals (eg: at 0.25x0.25, 0.25x0.5, 0.25x0.75, 0.5x0.25, 0.5x0.5, 0.5x0.75, 0.75x0.25, 0.75x0.5, 0.75x0.75 for low scanning resolution with 2 parameters), then increase or decrease resolution according to the complexity of the landscape (if the function seems simple after testing (less local minimas, we can save calculation by starting gradient descent in less places [=decreasing the scanning resolution], if more local minimas, we can increase the scanning resolution for better accuracy.)
Those are just a few thoughts upon seeing those graphs, with no idea about how things really work, and I guess overall we probably don’t want to be too accurate and try hard to find the global minima (data is likely to be inaccurate anyway) and we want to be fast (we prefer to be able to go through a lot of data than to use less data…)

Let me think about this. We try to predict something. We want to tweak some parameters and try to get the error smaller. And we know that if we change a just a little bit, it’s likely to increase or decrease the error. Let say you have an error of 10 and you see that if you increase just a little bit by 0.1 the value, the error will diminish by 1. Instead of trying 0.1 and get a little closer, you may want to try to go by 1, hoping it’s somewhat linear and hope it will change the error by 10 (perfect match). Of course it’s not going to be that simple, and then by advancing 1 you might end up having a (maybe very large) cost instead. In such case you try again with best value (which was the previous one), but with a smaller step. And every time you get the cost higher it means you took a step too big and you reduce it, until the cost value doesn’t move much.
And how much you reduce the step if error=cost increases? Somewhat proportionally to how bad the error=cost exploded (instead of reducing)

Please remember that even in the case of Logistic Regression the cost function is non-linear, so this may not be as simple as you say.

The Logistic Regression case is a special case and not really that interesting. Prof Ng introduces it as a “trivial” neural network with only one layer. All the rest of these courses are dealing with multi-layer neural networks which do not have convex cost functions and the paper I linked is exactly what we are dealing with, so we need to consider general solutions that work in those cases. Solutions that only work for the special case of Logistic Regression are not interesting.

Okay! This starts to get interesting. Agreed that it is not going to be that simple, but the idea is to try N step sizes before deciding which one to go. This is reasonable but then we need to consider other factors, such as:

Computational time

With the constant learning rate approach, it takes a forward pass and a backward pass of a batch of samples.

With the new approach being considered, it takes a forward pass, a backward pass, plus N additional forward passes of a batch of samples.

Stochastic nature of mini-batch gradient descent

If we examine the cost function carefully, we know that cost is a function of trainable parameters and also parametrized by the training dataset. In other words, if we supply a different dataset, the cost surface changes. In fact, we do supply a different dataset at each mini-batch gradient descent step, because every mini-batch is different, with, generally speaking, more difference in the case of smaller mini-batch size.

The question is then, if we set N to be large, then we spend additional computational time to find a good step for the current cost surface (constructed by the current mini-batch) which is going to change in the next mini-batch and in no way can we tell whether the additional N worths it for the subsequent mini-batches or for the whole problem.

Cheers,
Raymond

Up to this point, there was no question of “mini-batches.” We are working on a sigmoid function with convex loss function to minimize on all examples. I think you may mention a way to compute faster by not going through the whole dataset to get a first estimation, I guess this will come later in the lessons…
What we try to do, no matter if you start with a ‘random’ or ‘per feeling’ or ‘gradually reducing’ learning rate, (and I guess no matter if you do it on 1 example at a time, or on the full dataset, or a batch of examples at a time), is to minimize the number of steps(=calculations) (I am not sure if that’s what you call N). If the steps are too small, we are almost guaranteed to make progress, but we need to try again many times to get close to the minimal cost value. If the steps are too big, the cost increase, and we need to restart at the previous step value and try again without improvement. In both cases, it appears to me that by tracking how much improvement (or how much a degradation) you get with the current step (and maybe previous steps too), it should influence (help to have a better guess) on how big a step you try (or retry) the next time.
Since it has been mentioned there are automatic ways to dynamically calculate the best learning rate, I guess this is a solved problem and that we will learn about it later in the lesson.

My N was a result of misunderstanding your idea, but with more details from your last reply, that N is totally irrelevant. Please forget my last reply.

Course 2 introduces some adaptive algorithms, and you will see how they are based on somewhat different rationales.

Cheers,
Raymond