C1_W2 scilearn does not work with feature normalization on X_test

Hi,

I ran the cells of C1_W2_Lab05_Sklearn_GD_Soln.ipynb and used a different dataset. I shuffled the data and divided the set into X_train and X_test. I ran sgdr.fit(X_train, y_train) and predicted some values of X_test. I did y_pred_sgd = sgdr.predict(X_test) which all worked fine.

After that I added normalization
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
That just seemed to break the functionality. In particular, I got values for the predicted ys like 4.19e+09 5.92e+09 5.50e+09 (while the correct range would be between 1 and 100) So I suspect, that the value w_norm and b_norm are wrong which I took from the scikit-Variables
b_norm = sgdr.intercept_
w_norm = sgdr.coef_

I ran the same data through the gradient descent algorithm
w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 80, alpha = 2e-2)
with this code I get predicted ys like 23.38 29.12 23.97 which are perfectly in correct range.

I even feeded the normalized X into the scikit-Learner
sgdr.fit(X_norm, y_train) which didn´t help.

I have no idea how to dig into scikit and feel like loosing confidence, that it works just fine.

Does someone have an idea what could be the cause here?

Seems that you are going a different way from the optional lab of C1_W2_Lab05_Sklearn_GD_Soln: a different steps and a different dataset, so I can only comment on it in a general way.

This is likely because you had trained the model with the unnormalized data, but predicted with the normalized data. We usually do this reversely … (1) train/test split (2) normalize training data (3) train on normalized data (4) predict with normalized data

Great, this is the correct steps. Normalize first, train with normalized data.

I don’t see any problem, you have done it in the correct order.

1 Like

Than you rmwkwok for your answer!

To narrow down the problem, I am skipping the Splitting of the data.

First a working example:

  1. train sgdr with train data
sgdr.fit(X_train, y_train)
  1. predict values
y_pred_sgd = sgdr.predict(X_train)

This worked!

Now I go with feature normalization:

  1. Normalize X_train
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
  1. run regression on X_Norm
sgdr = SGDRegressor(max_iter=100)
sgdr.fit(X_norm, y_train)

3.Look at parameters

b_norm = sgdr.intercept_
w_norm = sgdr.coef_
  1. try to predict values
y_pred_sgd = sgdr.predict(X_norm)
y_pred_sgd = sgdr.predict(X_train)

The prediction fails on both datasets. I get values for y_pred_sgd way too big.

What are the parameters’ values for the case that max_iter = 100, and for the case that max_iter = 200?

For max_iter = 200 they are

w_norm
-8.70e+08 -2.45e+09 -9.21e+08  4.27e+09 -8.51e+09  1.69e+09  1.74e+09
  1.03e+09 -1.74e+08  1.40e+08 -4.12e+08 -4.14e+07  3.48e+08 -4.39e+08
b_norm
-3.39e+08

For max_iter = 100 they are

w_norm
-1.36e+08  1.41e+09 -8.28e+08 -4.77e+09  1.02e+09 -2.16e+06 -1.26e+08
  2.17e+09 -1.86e+08 -4.96e+08  3.85e+08  1.16e+08 -1.27e+08  2.37e+08
b_norm 
2448777.42

They look very much alike, and the sgdr prints that it is only iterating 20 times.

What would happen to the parameters’ values if you set max_iter = 100, learning_rate='constant', eta0=0.00001, n_iter_no_change=len(X_norm) ?

1 Like

That worked!

I get a warning from sgdr saying

ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

w_norm
-0.09 -0.03  0.07  0.15  0.02  0.06  0.23  0.29  0.6   0.86  0.75  0.26
  0.16 -0.18 -0.16 -0.44 -0.56 -0.38 -0.33 -0.37 -0.29 -0.35 -0.11 -0.13

when I predict values with this w, I get a reasonable prediction range

y_pred_sgd: [31.23 30.34 26.56

Can you elaborate on the parameters you passed to the sgdr please?

Actually the proper way to diagonise is to look at your data first, but no harm to make a few suggestions and so I did it.

The weights showed that they were diverging, and to control them, one way is to lower the learning rate. Setting learning_rate=‘constant’ because i do not want to rely on the default algorithm but my own eta0 value. Setting eta0 = 0.00001 is pretty arbitary - the key is it has to be small.

Now the warning says it doesn’t converge yet, you may increase the eta0 step by step. Each time, you increase the eta0 by 10 times (0.0001) , see if it diverges, if not diverge, increase another 10 times (0.001) until it diverges. Use the largest eta0 that does not diverge.

Then ask yourself how small do you want the errors to be, and set an appropriate tol value, after that, increase the max_iter until it converges. If you cannot make it converge by increasing max_iter , reduce n_iter_no_change=len(X_norm) step by step where at each step you reduce by a half.

Lastly, don’t forget to check the sklearn documentation for the meaning of those SGDRegressor parameters, and then try to understand the process I describe in above. :wink:

Good luck and keep trying!
Raymond

1 Like

Hi Raymond,

thanks for clarification. My confidence is restored. Actually it seems to me, that scilearn does a good job. It was important for me to experience that it cannot perform magic. So if you want to model to converge, you have to choose the learning rate just like you have to do with the gradient_descent algorithm.

My data has a size of 200000 learning examples. So I wondered, which value of n_iter_no_change you would suggest here? Not totally confident how this and the tol-value work together.

Anyway, thank you very much for your help. I appreciate it!

Bye

No algorithm is magical. It is not a plug-and-use thing. It is about understanding of data, experiments, and other hard works.

For n_iter_no_change, I would really suggest you to try different values yourself and see how it affects your metrics. You may start with these choices: 10, 100, 1000, 10000 to get a taste of it.

For how tol and n_iter_no_change works together, as sklearn says,

If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs

You may consider tol independently. If you think an improvement of cost of the size 0.000001 is unimportant, then your tol can be set to 0.000001 or higher.

Hi Raymond,

thank you for your answer. Such a pity that it is not magical. But still it is great to reach successes by these algrithms.

Can I picture the tol-parameter like the definition of a goal?

If I don´t reach the improvement of tol in n_iter_no_change steps, my algorithm is allowed to stop?

Regards,

Holger

Yes, you may. A tol of 0.0001 is like an error of 0.01 (because cost is squared error), which can be translated to you want your averaged error improvement to be somewhere at least 0.01 (Note that this translation is not mathematically rigoureus, it’s just a rough translation). Since your y values are normally in the range of 1-100, an error improvement of less than 0.01 should be negligible in normal case.

That would be by the max_iter parameter. We don’t rely on just one stopping criteria :slight_smile:

Since your y values are normally in the range of 1-100, an error improvement of less than 0.01 should be negligible in normal case.

But the improvement of my error by the squareRoot of tol has to take place in one iteration or in more iterations?

in this number of iterations: n_iter_no_change.

At each iteration, the cost changes. For example:

iter 1: 99.1
iter 2: 99.2
iter 3: 98.7
iter 4: 98.5

If we set tol to be 0.1, then from iter 1 to iter 2, tol is not satisfied, and in this case if n_iter_no_change = 1, then the SGD would have been stopped. if n_iter_no_change > 1, then SGD won’t stop and keep going, but the counter is set to 1, meaning that the threshold of tol has not been met for one consecutive time. From iter 2 to iter 3, it finds that the cost drops more than tol which is 0.1, the counter is reset to zero, and proceed to iter 4.

I think I got it now.

Thank you

Regards

Holger

1 Like

Just to make sure. If I think of tol as a goal for my improvement to be sufficient, I would think of the n_iter_no_change as the number of iterations where the average is taken of to compute if my goal is met. I think words are worse than your example to understand it.

Take care

Holger