I cannot get the expected results for applying regularization to linear regression in Python. The weights get bigger with each loop iteration and not smaller as I expect.
Can anyone check my code and see where I am going wrong?
Comments:
You have a really large lambda value, and a very small learning rate, and only 50 iterations.
Some experimentation seems to be worthwhile.
Rule of thumb: For fixed-rate gradient descent to work well numerically, the features should be within an order of magnitude (so roughly between -3.0 and +3.0).
The features in your example (including the polynomial terms) vary from -300 to 33,215 (or thereabouts).
So you might try pre-computing the polynomial terms (this would make X a matrix of size (m x p), where âpâ is the number of polynomial features youâre adding), and then normalize X, then run gradient descent.
Youâre encountering a common issue in machine learning, especially with polynomial regression: exploding gradients. This happens if your features have very large values, which then lead to extremely large magnitude values in the gradient calculation:
With gradients in the range of 10^6 to 10^9, your current learning rate of 0.001 is far too large. Easy fix: set learning rate to 1e-9 and increase the number of epochs. Better fix, as other mentors suggested is to scale your features.
Another problem is with weight decay factor 1 - \alpha \frac{\lambda}{m}: \alpha \frac{\lambda}{m} = 0.001 \times \frac{10000}{4} = 0.001 \times 2500 = 2.5
So, the factor becomes: 1 - 2.5 = -1.5. Your weight update rule effectively has this component w_j \leftarrow -1.5 \cdot w_j - \text{gradient_descent_term}, which increases the weight w_j to its magnitude by a factor of 1.5 in each iteration. Decreasing the learning rate also fixes this problem. Decreasing \lambda, is better solution.
When you have normalized features, you can use a larger learning rate (since the âexploding gradientâ problem has been avoided), and then you will need fewer epochs.