Choosing appropriate learning rate for gradient descent

Kristjan_Antunovic · August 9, 2022, 2:13pm

During the course we are presented with a way of choosing the appropriate learning rate alfa for gradient descent. It’s how I have been doing it in my personal exercises basically setting a higher reference value say 0.1 than running for N number of iterations after all iterations are done I visualize the cost/iteration graph and decide if I should increase or decrease the value.

However, this can be time-consuming before you get to a sensible value. For instance, I always compare my values for slope and intercept determined by N attempts of gradient descent to that of the SKLEARN Linear/LogisticRegression model as well as prediction using my own method and SKLEARN.predict method.

Somehow SKLEARN always gets better (not by a large margin) values for slopes and intercept. Looking at the code it uses a separate optimization library.

Looking at StatQuest we can find values for slope and intercept using the following formula:

x.mean * y.mean - (x * y).mean
----------------------------------------------  = slope
(x.mean)^2 - (x^2).mean

and

y.mean - slope * x.mean = y_intercept

This gives nearly identical values to what SKLEARN returns and gives predictions nearly exactly the same.

What I wanted to know is how well does this method apply in different scenarios compared to running gradient descent. Or is there a better way to determine slope & intercept values other than manual re-run of the gradient descent?

One thing I have been doing so that I don’t have to sit behind the PC is basically run gradient descent and calculate R^2 using it. Using R^2 I do the next runs increasing/decreasing the learning rate and the number of iterations. I do this N number of times then come back and see where I got and its usually somewhat close to SKLEARN.

rmwkwok · August 9, 2022, 2:27pm

Hi @Kristjan_Antunovic,

Linear regression can be solved analytically, meaning that you can literally solve it by some simple algebra which is basically the formula that you have quoted. You may google the “Normal equation” for a more general form of the formula. Some packages use SVD-based methods to solve the linear regression problem instead of the Normal Equation and this is why sometimes you find different packages give very similar but not identical solution, even they are all not using gradient descent.

Because you can solve it with the Normal Equation, gradient descent is not always necessary here unless your dataset is just too big to fit into your computer’s memory and in such case you will need gradient descent which can be configured to only handle a small subset of your data at a time.

However, if it is not linear regression, there is no guarantee we can find an analytical solution. Actually most of the case we cannot. Here gradient descent is always a valid option to go for.

Cheers,
Raymond

Kristjan_Antunovic · August 9, 2022, 3:08pm

Thank you so much. @rmwkwok you are awesome.

rmwkwok · August 9, 2022, 3:10pm

You are welcome Kristjan. Thanks for sharing your findings with us too

Topic		Replies	Views
Linear regression -alpha-gradient descent Supervised ML: Regression and Classification week-module-1	2	487	March 14, 2023
Linear Regression Model implementation Supervised ML: Regression and Classification week-module-2	18	134	June 24, 2025
Gradient Descent in Built-in Python/R Libraries for Models Supervised ML: Regression and Classification	3	311	July 14, 2022
Does gradient descent of cost function give the same regression line as ordinary least squares? Supervised ML: Regression and Classification week-module-1	5	537	September 27, 2022
Learning rate, cost functions, and gradient descent Supervised ML: Regression and Classification week-module-1	4	37	July 13, 2024

Choosing appropriate learning rate for gradient descent

Related topics