Can someone help explain this line

This line is mentioned in Week 2’s optional lab for feature scaling and learning rate: “when generating the plot, the normalized features were used. Any predictions using the parameters learned from a normalized training set must also be normalized.”

Does this mean that the value our model predicts will need to be de-normalised? Because, after this quote, in the lab, it lets us predict a price for the house and it does not show any de-normalization.

No, I think the point is that if the model is trained on normalized data, then when you want to make a prediction with new data, that input data needs to be normalized in the same way. The point is that the model is only “understands” normalized data, so that’s all it can deal with as input. For example, if you are doing “mean normalization” and predicting on a single sample, it doesn’t make sense to use the “mean” of a single sample. You have to save the normalization parameters (\mu and \sigma) from the training set.

I should make the disclaimer that I have not taken MLS, so I’m not sure what Prof Ng says in the lectures there. I would hope that he said something about this and that it would be along the lines of what I said above.

1 Like

There wasn’t much mention of it in the actual lectures, but as stated earlier it was mentioned in a lab. My current interpretation is that if we normalize the training set (not just features), in that case, we would need to de-normalize the prediction because it was trained with normalized y-values.

No, the predictions are the predictions, right? What would it mean to “denormalize” them? You don’t normalize the y values, right? Those are just prices in dollars in this case.

The point of normalization is that it affects only the input feature data and the purpose of it is that it makes the training work better (converge faster). Notice that you have wildly divergent values for the different features (parameters): number of bedrooms is a number between 1 and 6, whereas square feet is a number in the range of hundreds to thousands, right? That makes the solution surfaces have very steep slopes in some dimensions and shallow in others, which makes it hard for Gradient Descent to converge efficiently. Normalizing all the input features to have (say) \mu = 0 and \sigma = 1 gives a much better behaved surface and easier (in most cases) convergence.


Yes, I agree with you. What I meant is that the statement refers to general cases where people may have normalized the target data as well before fitting a model. The statement mentioned does not refer to the specific example they demonstrated in the lab.

So, if we were to normalize the “training set” as a whole, only then would we need to de-normalize the prediction. Else, if we only normalized the features, and not the targets, then we would not need to de-normalize the output.

For all intents and purposes, I agree that we need not normalize the target values in the first place (and thus, not need to de-normalize the output) as demonstrated in the lab.

I have never seen a case in which the target (label) values or the output of a network was normalized. You apply an output activation function (e.g. sigmoid in the case of a binary classification) and then a cost function (cross entropy for a classification or MSE for a regression problem). But you never “normalize” the output. At least I’ve never seen an instance of that. If you have seen references to that, please give us a link so that I can investigate further.

No, I have not seen any such case. I just have this statement to go off which seems to say “Any predictions using the parameters learned from a normalized training set must also be normalized.”.

If this wasn’t even mentioned, i would have assumed the result was the actual price of the house (since we didn’t normalise the target value, as you previously mentioned).

Maybe it meant that if we were to plot price against a normalised feature, we would need to normalise the price in order to make a good plot(?)

I think you are just misinterpreting that statement. I explained what they really meant in my first response on this thread. You only need to normalize the features in order to make the prediction, because the model only is trained on normalized feature data.


Okay I think I now understand it perfectly. Thank you for the help and keeping up with this query :slight_smile: