I ran the cells of C1_W2_Lab05_Sklearn_GD_Soln.ipynb and used a different dataset. I shuffled the data and divided the set into X_train and X_test. I ran sgdr.fit(X_train, y_train) and predicted some values of X_test. I did y_pred_sgd = sgdr.predict(X_test) which all worked fine.

After that I added normalization
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
That just seemed to break the functionality. In particular, I got values for the predicted ys like 4.19e+09 5.92e+09 5.50e+09 (while the correct range would be between 1 and 100) So I suspect, that the value w_norm and b_norm are wrong which I took from the scikit-Variables
b_norm = sgdr.intercept_
w_norm = sgdr.coef_

I ran the same data through the gradient descent algorithm
w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 80, alpha = 2e-2)
with this code I get predicted ys like 23.38 29.12 23.97 which are perfectly in correct range.

I even feeded the normalized X into the scikit-Learner
sgdr.fit(X_norm, y_train) which didn´t help.

I have no idea how to dig into scikit and feel like loosing confidence, that it works just fine.

Does someone have an idea what could be the cause here?

Seems that you are going a different way from the optional lab of C1_W2_Lab05_Sklearn_GD_Soln: a different steps and a different dataset, so I can only comment on it in a general way.

This is likely because you had trained the model with the unnormalized data, but predicted with the normalized data. We usually do this reversely … (1) train/test split (2) normalize training data (3) train on normalized data (4) predict with normalized data

Great, this is the correct steps. Normalize first, train with normalized data.

I don’t see any problem, you have done it in the correct order.

Actually the proper way to diagonise is to look at your data first, but no harm to make a few suggestions and so I did it.

The weights showed that they were diverging, and to control them, one way is to lower the learning rate. Setting learning_rate=‘constant’ because i do not want to rely on the default algorithm but my own eta0 value. Setting eta0 = 0.00001 is pretty arbitary - the key is it has to be small.

Now the warning says it doesn’t converge yet, you may increase the eta0 step by step. Each time, you increase the eta0 by 10 times (0.0001) , see if it diverges, if not diverge, increase another 10 times (0.001) until it diverges. Use the largest eta0 that does not diverge.

Then ask yourself how small do you want the errors to be, and set an appropriate tol value, after that, increase the max_iter until it converges. If you cannot make it converge by increasing max_iter , reduce n_iter_no_change=len(X_norm) step by step where at each step you reduce by a half.

Lastly, don’t forget to check the sklearn documentation for the meaning of those SGDRegressor parameters, and then try to understand the process I describe in above.

thanks for clarification. My confidence is restored. Actually it seems to me, that scilearn does a good job. It was important for me to experience that it cannot perform magic. So if you want to model to converge, you have to choose the learning rate just like you have to do with the gradient_descent algorithm.

My data has a size of 200000 learning examples. So I wondered, which value of n_iter_no_change you would suggest here? Not totally confident how this and the tol-value work together.

Anyway, thank you very much for your help. I appreciate it!

No algorithm is magical. It is not a plug-and-use thing. It is about understanding of data, experiments, and other hard works.

For n_iter_no_change, I would really suggest you to try different values yourself and see how it affects your metrics. You may start with these choices: 10, 100, 1000, 10000 to get a taste of it.

For how tol and n_iter_no_change works together, as sklearn says,

If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs

You may consider tol independently. If you think an improvement of cost of the size 0.000001 is unimportant, then your tol can be set to 0.000001 or higher.

Yes, you may. A tol of 0.0001 is like an error of 0.01 (because cost is squared error), which can be translated to you want your averaged error improvement to be somewhere at least 0.01 (Note that this translation is not mathematically rigoureus, it’s just a rough translation). Since your y values are normally in the range of 1-100, an error improvement of less than 0.01 should be negligible in normal case.

That would be by the max_iter parameter. We don’t rely on just one stopping criteria

iter 1: 99.1
iter 2: 99.2
iter 3: 98.7
iter 4: 98.5

If we set tol to be 0.1, then from iter 1 to iter 2, tol is not satisfied, and in this case if n_iter_no_change = 1, then the SGD would have been stopped. if n_iter_no_change > 1, then SGD won’t stop and keep going, but the counter is set to 1, meaning that the threshold of tol has not been met for one consecutive time. From iter 2 to iter 3, it finds that the cost drops more than tol which is 0.1, the counter is reset to zero, and proceed to iter 4.

Just to make sure. If I think of tol as a goal for my improvement to be sufficient, I would think of the n_iter_no_change as the number of iterations where the average is taken of to compute if my goal is met. I think words are worse than your example to understand it.