This is my first notebook and I have started coding up the logistic regression from scratch.
How do I start engineering the features, how can I make my gradient descent run in the way so the loss goes on decreasing and converges, How can I make further improvements to the training data to help my model work well?
Hi @Shashank_Garg
you did pretty good work here, To answer your question you should refer to the Machine learning diagnostic in which, you should recognize the bias and variance factor and there are 6 ways you can improve your result( 3 for bias and 3 for variance). All of these are defined very well in course 2.
The only thing I would add to the other replies is the following:
You asked how you can guarantee that gradient descent converges. Actually for logistic regression there is a theorem that guarantees this as long as the step size is small enough (technically this is because the cost function is convex and has Lipschitz continuous gradient… but you don’t need to worry about the details on that). Just know that if you pick a small enough step size, you will get convergence (this is true for both linear and logistic regression). In your code you have alpha = 0.00001 and this is apparently too large.
Thank you so much Sir, In the dataset that I have chosen, a learning rate of 1e-5 was also very high and then after I did feature scaling as suggested by Raymond Sir, the model performed well with learning rate 1e-5 also.
Now the only part that remains is, whether or not the learning rate’s choice is dependent on data or something in range 1e-3 is preferred, by performing some transformations on data ?
Thank you so much Sir for this insight. I performed scaling, but the gradient descent turns out to run extremely slow, any suggestions?
I have edited the link posted in the thread.
I see you defined scale_features function here but I do not see you use it. Where did you use it? However, i do see that the cost keeps decreasing instead of bouncing back previously. I guess you had used it but somehow not shown in the notebook? Maybe you want to have a check on the notebook too?
Keeping decreasing is a good sign and is an improvement over bouncing back, and I suppose you have noticed that too. Noticing difference is very important . After you applied feature scaling, you may try increasing your learning rate if it is too slow. There are 3 points I want to make about scaling:
When you scale features, you have some scaling constants, right? For example, min_vec. Those constants come from only your training set.
We need to scale data for testing or for prediction the same way we scale our training data, so we need to keep the constants derived from the training set, and then use them to scale the testing set.
It’s very good that you make your own scaling function, but it is also a good exercise to try out those in sklearn: StandardScaler, MinMaxScaler. These are some basic stuff and it’s good that you know them too.
Dear Sir, once again thank you so much for your efforts to ponder over the notebook,
Sir, actually during debugging, I somehow didn’t called the scale_features function and I’m sorry for that and also after calling the same I noticed that the choice alpha really got increased to over 50000 times and therfore the gradient descent works faster!
I’ll surely get my hands dirty with scikit methods too!
Thank you so much Sir!!
Glad to help! To answer your question, the learning rate does depend on the dataset. There is a formula in the case of logistic regression (not discussed in this course, but true nonetheless) that says that convergence is guaranteed if the learning rate \alpha is (less than or) equal to \frac{1}{L}, where
L = \frac{1}{\sqrt{m}}||\hat{X}||_2
where m is the number of rows in X, the matrix \hat{X} is just the matrix X with a column of all 1’s added as a new leftmost column, and ||\hat{X}||_2 is the 2-norm of \hat{X}. You can compute this as follows:
import numpy as np
from numpy import linalg as LA
m = X.shape[0]
ones = np.ones((m,1))
Xhat = np.concatenate((ones,X), axis = 1)
L = LA.norm(Xhat,2)/m
alpha = 1/L
Best,
Alex
edit: I should also say that this gives the smallest value you would ever need to set your learning rate. You can of course experiment with larger values but this is a good starting point. You could increment your learning rate to larger values from here if convergence at this rate is too slow.
edit 2: corrected formula (wrote \frac{1}{m} originally but it should be \frac{1}{\sqrt{m}}.
I have read your comment about feature scaling. It is very true that we have one learning rate for all features. However, if some features have much larger ranges, ideally we want different learning rates for different features, which as we all know isn’t feasible given the current gradient descent algorithm.
The fatal problem about learning rate is when it is too large, we suffer from diverging cost. Since features are differently ranged, a large learning rate may be just good for one feature, but then too large for some others which is what we have to avoid in order not to suffer from the fatal problem.
That’s why we are left with one option, we decrease the learning rate until it is not too large for every feature. In this case, our cost won’t diverge, but at the cost of slow learning. This is why we want to scale the features so that they have very similar ranges, so that we won’t need to compromise our learning rate for the features that have large ranges. As a result, we won’t suffer from cost divergence, and our model can learn faster.
Knowing that the learning rate is quite related to the features’ ranges, always normalizing your features give you an advantage to use similar learning rates for any problem. Of course I can’t guarantee the same learning rate but you will somehow know what is a good learning rate to start with. You see how this somehow come ? It comes from your understanding on learning rate, and your experience. Try more
I hope you are still trying something on this notebook. If you run out of ideas, I would suggest you to implement this logistic regression with Tensorflow - a single layer neural network with sigmoid as activation function. But please do this on a different notebook so we can be more focused on this implementation.
Cheers,
Raymond
PS: @aachandler’s idea is super interesting indeed. At the very least, it’s a good idea to keep the code for estimating a good learning rate, and try it on your next 10 datasets - both before and after feature scaling and keep a log of your trial. The log can be like a table of 2 columns: estimated learning rate before scaling; and after scaling. Of course it is totally up to you