Choosing appropriate learning rate for gradient descent

During the course we are presented with a way of choosing the appropriate learning rate alfa for gradient descent. It’s how I have been doing it in my personal exercises basically setting a higher reference value say 0.1 than running for N number of iterations after all iterations are done I visualize the cost/iteration graph and decide if I should increase or decrease the value.

However, this can be time-consuming before you get to a sensible value. For instance, I always compare my values for slope and intercept determined by N attempts of gradient descent to that of the SKLEARN Linear/LogisticRegression model as well as prediction using my own method and SKLEARN.predict method.

Somehow SKLEARN always gets better (not by a large margin) values for slopes and intercept. Looking at the code it uses a separate optimization library.

Looking at StatQuest we can find values for slope and intercept using the following formula:

x.mean * y.mean - (x * y).mean
----------------------------------------------  = slope
(x.mean)^2 - (x^2).mean


y.mean - slope * x.mean = y_intercept

This gives nearly identical values to what SKLEARN returns and gives predictions nearly exactly the same.

What I wanted to know is how well does this method apply in different scenarios compared to running gradient descent. Or is there a better way to determine slope & intercept values other than manual re-run of the gradient descent?

One thing I have been doing so that I don’t have to sit behind the PC is basically run gradient descent and calculate R^2 using it. Using R^2 I do the next runs increasing/decreasing the learning rate and the number of iterations. I do this N number of times then come back and see where I got and its usually somewhat close to SKLEARN.

Hi @Kristjan_Antunovic,

Linear regression can be solved analytically, meaning that you can literally solve it by some simple algebra which is basically the formula that you have quoted. You may google the “Normal equation” for a more general form of the formula. Some packages use SVD-based methods to solve the linear regression problem instead of the Normal Equation and this is why sometimes you find different packages give very similar but not identical solution, even they are all not using gradient descent.

Because you can solve it with the Normal Equation, gradient descent is not always necessary here unless your dataset is just too big to fit into your computer’s memory and in such case you will need gradient descent which can be configured to only handle a small subset of your data at a time.

However, if it is not linear regression, there is no guarantee we can find an analytical solution. Actually most of the case we cannot. Here gradient descent is always a valid option to go for.


1 Like

Thank you so much. @rmwkwok you are awesome.

You are welcome Kristjan. Thanks for sharing your findings with us too :slight_smile: