Problem getting my Python code to perform regularization of linear regression

I cannot get the expected results for applying regularization to linear regression in Python. The weights get bigger with each loop iteration and not smaller as I expect.

Can anyone check my code and see where I am going wrong?

Thanks
main.py (1.1 KB)

Hi @ai_is_cool

You’re using L2 regularization, but it should only apply to the weights w, not to the bias b. Also, your reg_lambda is veeeery big!

Hope it helps!

I’m not applying Regularization to b.

Is the code correct?

Can you try changing some values to \lambda, \alpha and w to get it to work and reduce the weights with each iteration?

Comments:
You have a really large lambda value, and a very small learning rate, and only 50 iterations.
Some experimentation seems to be worthwhile.

Rule of thumb: For fixed-rate gradient descent to work well numerically, the features should be within an order of magnitude (so roughly between -3.0 and +3.0).

The features in your example (including the polynomial terms) vary from -300 to 33,215 (or thereabouts).

So you might try pre-computing the polynomial terms (this would make X a matrix of size (m x p), where ‘p’ is the number of polynomial features you’re adding), and then normalize X, then run gradient descent.

I was setting \lambda= 10000 and \alpha = 0.001 because it was a value Prof. Ng set in his video lesson.

So I should be using z-score standardisation on a matrix of X and its integer powers up to 4?

It is certainly worth a try.

Do you know what particular situations and limitations of value of X and w regularization is best-suited to combat?

Regularization is not a matter of the W and x values.

It’s a matter of the number of training examples (m) you have compared to the number of features (n).

The closer you are to 1:1, the more likely it is you will have overfitting, and the more regularization you will need to avoid it.

Hi @ai_is_cool,

You’re encountering a common issue in machine learning, especially with polynomial regression: exploding gradients. This happens if your features have very large values, which then lead to extremely large magnitude values in the gradient calculation:

dj_dw = [ 0.00000000e+00 -1.14980966e+06 -1.60325406e+07 -2.06758237e+08
 -2.80586600e+09]   dj_db = -99443.36742287749

With gradients in the range of 10^6 to 10^9, your current learning rate of 0.001 is far too large. Easy fix: set learning rate to 1e-9 and increase the number of epochs. Better fix, as other mentors suggested is to scale your features.
Another problem is with weight decay factor 1 - \alpha \frac{\lambda}{m}:
\alpha \frac{\lambda}{m} = 0.001 \times \frac{10000}{4} = 0.001 \times 2500 = 2.5
So, the factor becomes: 1 - 2.5 = -1.5. Your weight update rule effectively has this component w_j \leftarrow -1.5 \cdot w_j - \text{gradient_descent_term}, which increases the weight w_j to its magnitude by a factor of 1.5 in each iteration. Decreasing the learning rate also fixes this problem. Decreasing \lambda, is better solution.

1 Like

Ok thanks. I will try that.

It’s working now. I reduced the training example values to less than 3 but greater than -3 and the number of epochs to 5e6.

Is it typical to have to set the number of epochs to such a high value?

When you have normalized features, you can use a larger learning rate (since the “exploding gradient” problem has been avoided), and then you will need fewer epochs.

1 Like

Ok thanks.

I will try that.