I understand the need for scaling and how it is carried out. But I assumed the target values were also being scaled, which they were not. Is it a matter of choice or are there other factors to be considered when its involving target variables ?
Also, Based on this I have 2 other questions which might be out of scope.
Is feature scaling applicable to other ML problems like forecast based on time series ?
If feature scaling is applicable in the above question, then do we only scale the input/features (without scaling target values) ?
Hello @Axleblz, they are great questions. Let me give them a try and everyone please correct me if you have different views!
Some factors to consider, when using neural network:
matching between the activation function in the output layer and the target variable range - e.g.: ReLU produces only non-negative numbers; Sigmoid only between 0 and 1.
in the case of using a linear activation, whether you have a bias term enabled to shift the output into the range of the target values. E.g. if your target is distributed between 1000 and 3000 with a mean of 2000, you probably want to enable the bias term and train it to somewhere near 2000. However, if you rescale your target range to having a mean of 0, you might be able to disable the bias term and train one less parameter.
if your target values are too large, computationlly we can encounter an numerical overflow issue during the training process when the prediction is overestimating the targets.
I think it’s always not harmful to scale the target values, but you may see that my above considerations are not common problems and devastating. But not doing scaling on features can get us a very hard time converging to the optimal solution.
I will consider this from the angle of which modeling assumptions/methods/techniques we use. If we do time series forecast using decision trees-based methods (e.g. xgboost), we don’t need to do feature scaling; but using a neural network (including linear/logistic regression) we need to.
Distance-based methods (especially unsupervised ones) need feature scaling, e.g. k-means, k-NN, so that a feature can’t dominiate the distance by nothing but a way larger range. PCA also needs it because it is based on comparing features’ variance - which is related to how large the ranges are.
Again, it depends on the modeling methods. If it is neural network, regardless it is a time series forecast or not, we might want to go through my considerations and others that I may have missed out. If it is decision tree-based, we don’t need any scaling.
You are welcome @Axleblz. I seldom see discussion of rescaling the target variables, so those are what I can think of. If you find some other interesting views or opposite views in your learning journey, please share with us
Hi @rmwkwok, taking advantage of this question I would like to ask: In linear regression we only use scaling only for multi-variable or we can also use it on univariate? Today I was working on what I learned about univariate Linear Regression and tried to applied scaling to the input data and well… I got different different prediction values.
I think there are 2 reasons we want to scale features:
to keep features in similar scales - comparing scales of any 2 features used in the current model
to keep features in familiar scales - comparing scales used in training any model
I think it’s almost necessary to do scaling for the multiple linear regression for more the reason number 1 because in this way all the weights will be moved (updated / learnt) to their optimal values in a similar pace which is good for learning speed or efficiency.
Although it is much less important to do scaling for the univariate linear regression, I would still recommend it for the reason number 2 because problems of similar scales can be dealt with in a similar way, and for example, we can use a similar size of learning rate for all problems sharing the similar scales. This helps us reuse our experience in hyperparameter (e.g. learning rate) tuning for future problems.
In short, I recommend rescaling in any problem.
I can’t tell you why the prediction is different just from your description, but I would make sure at least 2 things that I can think of at this moment:
They converged - if you use a gradient descent based algorithm, you need to make sure the number of iterations is large enough for the weights (or better, the loss) to converge. Scaling or not makes a difference to the number of iteration required given the same learning rate value.
You apply the same feature scaling for testing samples - once we scale features for training samples, we remember those scaling factors and use them to scale features of testing samples, so that both training and testing samples receive the same treatment.
Thank you for giving additional reasons behind the feature scaling. I have had some hardships when I wrote a program on selenium in python that is used to predict the house price by gathering information on internet. The problem was that I could not reach the convergence at first, but then I got it by scaling a model. However, when I apply the mean normalization feature scaling, my target values differs from what it actually is. How can I rescale the target values by mean normalization or z-score method? Hope this question is appropriate. Better to ask rather than leaving it as it is. Thank you in advance.
If you have not scaled your target values, how would your target values become different? It won’t change by itself. You must have done something to it. What have you done?
Got it, if there is no need to rescale, I will check my code. And I guess I have to practice on lab, I have found a practical example there. Thank you very much for quick response!
OK. @Otabek_Nurmatov , here is a general work flow for your reference
Given a training dataset, we have the features X and the targets (or called labels) y. Then we apply feature normalization to Xonly, and y is untouched.
Then we fit the model with the normalized X and the y.
When making a prediction a new sample X_new, we normalize X_new with the same method, the same mean, and the same standard deviation parameters that we have used to normalizeX.
Then we feed the normalized X_new into the trained model, and we will get our predictions y_pred. y_pred does not need to scaled back.
So, if we follow this pretty standard work flow, then we only normalize and we do not un-normalize, also, we only normalize X, and we do not normalize y.
Let me know if you have any questions after practicing with the lab.
Aha, so the variables that we take to model should be also normalized. that really makes sense. I have used the parameters of beta0 and beta1 for non normalized new variables. So my dependent variable got super high deviations from actual price. Thank you Raymond!!!
Hello! Two years after the post, I would like to ask a question. Is it wrong to also scale the y train data (targets for training)? What I have understood so far is that we fit the scaler on the training data and then we use this scaler to transform both training and testing data. Then by using this scaler, we also apply inverse_transform for our predictions to unscale them. Is this a wrong approach?
As Tom said, it is usually not necessary. Take linear regression as an example, the bias term will take care of the mean of y_train, and the weights will take care of the variance. Unless you don’t want the bias and the weights to be too large due to the labels, there is no need to scale the labels.