C1_W2 scilearn does not work with feature normalization on X_test

Holger_Prokoph · August 2, 2022, 4:37pm

Hi,

I ran the cells of C1_W2_Lab05_Sklearn_GD_Soln.ipynb and used a different dataset. I shuffled the data and divided the set into X_train and X_test. I ran sgdr.fit(X_train, y_train) and predicted some values of X_test. I did y_pred_sgd = sgdr.predict(X_test) which all worked fine.

After that I added normalization
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
That just seemed to break the functionality. In particular, I got values for the predicted ys like 4.19e+09 5.92e+09 5.50e+09 (while the correct range would be between 1 and 100) So I suspect, that the value w_norm and b_norm are wrong which I took from the scikit-Variables
b_norm = sgdr.intercept_
w_norm = sgdr.coef_

I ran the same data through the gradient descent algorithm
w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 80, alpha = 2e-2)
with this code I get predicted ys like 23.38 29.12 23.97 which are perfectly in correct range.

I even feeded the normalized X into the scikit-Learner
sgdr.fit(X_norm, y_train) which didn´t help.

I have no idea how to dig into scikit and feel like loosing confidence, that it works just fine.

Does someone have an idea what could be the cause here?

rmwkwok · August 3, 2022, 1:23am

Seems that you are going a different way from the optional lab of C1_W2_Lab05_Sklearn_GD_Soln: a different steps and a different dataset, so I can only comment on it in a general way.

This is likely because you had trained the model with the unnormalized data, but predicted with the normalized data. We usually do this reversely … (1) train/test split (2) normalize training data (3) train on normalized data (4) predict with normalized data

Great, this is the correct steps. Normalize first, train with normalized data.

I don’t see any problem, you have done it in the correct order.

Holger_Prokoph · August 3, 2022, 6:47am

Than you rmwkwok for your answer!

To narrow down the problem, I am skipping the Splitting of the data.

First a working example:

train sgdr with train data

sgdr.fit(X_train, y_train)

predict values

y_pred_sgd = sgdr.predict(X_train)

This worked!

Now I go with feature normalization:

Normalize X_train

scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)

run regression on X_Norm

sgdr = SGDRegressor(max_iter=100)
sgdr.fit(X_norm, y_train)

3.Look at parameters

b_norm = sgdr.intercept_
w_norm = sgdr.coef_

try to predict values

y_pred_sgd = sgdr.predict(X_norm)
y_pred_sgd = sgdr.predict(X_train)

The prediction fails on both datasets. I get values for y_pred_sgd way too big.

rmwkwok · August 3, 2022, 7:01am

What are the parameters’ values for the case that max_iter = 100, and for the case that max_iter = 200?

Holger_Prokoph · August 3, 2022, 7:19am

For max_iter = 200 they are

w_norm
-8.70e+08 -2.45e+09 -9.21e+08  4.27e+09 -8.51e+09  1.69e+09  1.74e+09
  1.03e+09 -1.74e+08  1.40e+08 -4.12e+08 -4.14e+07  3.48e+08 -4.39e+08
b_norm
-3.39e+08

For max_iter = 100 they are

w_norm
-1.36e+08  1.41e+09 -8.28e+08 -4.77e+09  1.02e+09 -2.16e+06 -1.26e+08
  2.17e+09 -1.86e+08 -4.96e+08  3.85e+08  1.16e+08 -1.27e+08  2.37e+08
b_norm 
2448777.42

They look very much alike, and the sgdr prints that it is only iterating 20 times.

rmwkwok · August 3, 2022, 7:21am

What would happen to the parameters’ values if you set max_iter = 100, learning_rate='constant', eta0=0.00001, n_iter_no_change=len(X_norm) ?

Holger_Prokoph · August 3, 2022, 7:31am

That worked!

I get a warning from sgdr saying

ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

w_norm
-0.09 -0.03  0.07  0.15  0.02  0.06  0.23  0.29  0.6   0.86  0.75  0.26
  0.16 -0.18 -0.16 -0.44 -0.56 -0.38 -0.33 -0.37 -0.29 -0.35 -0.11 -0.13

when I predict values with this w, I get a reasonable prediction range

y_pred_sgd: [31.23 30.34 26.56

Can you elaborate on the parameters you passed to the sgdr please?

rmwkwok · August 3, 2022, 7:41am

Actually the proper way to diagonise is to look at your data first, but no harm to make a few suggestions and so I did it.

The weights showed that they were diverging, and to control them, one way is to lower the learning rate. Setting learning_rate=‘constant’ because i do not want to rely on the default algorithm but my own eta0 value. Setting eta0 = 0.00001 is pretty arbitary - the key is it has to be small.

Now the warning says it doesn’t converge yet, you may increase the eta0 step by step. Each time, you increase the eta0 by 10 times (0.0001) , see if it diverges, if not diverge, increase another 10 times (0.001) until it diverges. Use the largest eta0 that does not diverge.

Then ask yourself how small do you want the errors to be, and set an appropriate tol value, after that, increase the max_iter until it converges. If you cannot make it converge by increasing max_iter , reduce n_iter_no_change=len(X_norm) step by step where at each step you reduce by a half.

Lastly, don’t forget to check the sklearn documentation for the meaning of those SGDRegressor parameters, and then try to understand the process I describe in above.

Good luck and keep trying!
Raymond

Holger_Prokoph · August 3, 2022, 8:56am

Hi Raymond,

thanks for clarification. My confidence is restored. Actually it seems to me, that scilearn does a good job. It was important for me to experience that it cannot perform magic. So if you want to model to converge, you have to choose the learning rate just like you have to do with the gradient_descent algorithm.

My data has a size of 200000 learning examples. So I wondered, which value of n_iter_no_change you would suggest here? Not totally confident how this and the tol-value work together.

Anyway, thank you very much for your help. I appreciate it!

Bye

rmwkwok · August 3, 2022, 11:00am

No algorithm is magical. It is not a plug-and-use thing. It is about understanding of data, experiments, and other hard works.

For n_iter_no_change, I would really suggest you to try different values yourself and see how it affects your metrics. You may start with these choices: 10, 100, 1000, 10000 to get a taste of it.

For how tol and n_iter_no_change works together, as sklearn says,

If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs

You may consider tol independently. If you think an improvement of cost of the size 0.000001 is unimportant, then your tol can be set to 0.000001 or higher.

Holger_Prokoph · August 4, 2022, 7:49am

Hi Raymond,

thank you for your answer. Such a pity that it is not magical. But still it is great to reach successes by these algrithms.

Can I picture the tol-parameter like the definition of a goal?

If I don´t reach the improvement of tol in n_iter_no_change steps, my algorithm is allowed to stop?

Regards,

Holger

rmwkwok · August 4, 2022, 7:58am

Yes, you may. A tol of 0.0001 is like an error of 0.01 (because cost is squared error), which can be translated to you want your averaged error improvement to be somewhere at least 0.01 (Note that this translation is not mathematically rigoureus, it’s just a rough translation). Since your y values are normally in the range of 1-100, an error improvement of less than 0.01 should be negligible in normal case.

That would be by the max_iter parameter. We don’t rely on just one stopping criteria

Holger_Prokoph · August 4, 2022, 8:07am

Since your y values are normally in the range of 1-100, an error improvement of less than 0.01 should be negligible in normal case.

But the improvement of my error by the squareRoot of tol has to take place in one iteration or in more iterations?

rmwkwok · August 4, 2022, 8:17am

in this number of iterations: n_iter_no_change.

At each iteration, the cost changes. For example:

iter 1: 99.1
iter 2: 99.2
iter 3: 98.7
iter 4: 98.5

If we set tol to be 0.1, then from iter 1 to iter 2, tol is not satisfied, and in this case if n_iter_no_change = 1, then the SGD would have been stopped. if n_iter_no_change > 1, then SGD won’t stop and keep going, but the counter is set to 1, meaning that the threshold of tol has not been met for one consecutive time. From iter 2 to iter 3, it finds that the cost drops more than tol which is 0.1, the counter is reset to zero, and proceed to iter 4.

Holger_Prokoph · August 4, 2022, 8:22am

I think I got it now.

Thank you

Regards

Holger

Holger_Prokoph · August 5, 2022, 6:05am

Just to make sure. If I think of tol as a goal for my improvement to be sufficient, I would think of the n_iter_no_change as the number of iterations where the average is taken of to compute if my goal is met. I think words are worse than your example to understand it.

Take care

Holger

Topic		Replies	Views
Scikit Regression comparison of train vs normalized X Supervised ML: Regression and Classification week-module-2	2	410	July 5, 2023
C1_W2_Lab05_Sklearn how to predict a house price? Supervised ML: Regression and Classification week-module-2	3	533	July 5, 2022
How to use SGDRegressor to get prediction for a specific input? Supervised ML: Regression and Classification week-module-2	3	513	January 5, 2023
# Optional Lab: Linear Regression using Scikit-Learn Supervised ML: Regression and Classification week-module-2	4	642	December 15, 2022
MLS C1 W2 Lab 5 question about normalization (moderator edit) Supervised ML: Regression and Classification	5	287	December 28, 2023

C1_W2 scilearn does not work with feature normalization on X_test

Related topics