Order of applying feature engineering and normalization

Hi, from example in the course it seems that feature engineering should be performed before normalization.
But what happens if I change the order of operations? first - normalization and after that - engineering.
For example there is feature X(i) with values in range [-10^9, 10^9]. If I add new feature based on existing one, range for square value will be already [-10^18, 10^18].

Hi Mikhail,

Very interesting question!

  1. True that a number is bounded by a limited range, and we can’t go beyond that
    a. np.float32: -3.40e+38 to 3.40e+38
    b. np.float64: -1.79e+308 to 1.79e+308

  2. However, switching the order doesn’t produce the same outcome:
    a. squared first, then normalize: \frac{x^2 - mean(x^2)}{std(x^2)}
    b. normalize first, then squared: ({\frac{x - mean(x)}{std(x)}})^2 = \frac{x^2 - 2xmean(x) + ({mean(x)})^2}{{(std(x))}^2}

  3. Also, even we do (2b), we will still need to normalize it afterwards, and that won’t bring us back to (2a) either.

  4. Given that they don’t produce the same outcome, we need to decide which outcome is needed. In Linear regression for example, we want to make sure the engineered feature to be linear with the predicting label, and with that, we might use (2a), (2b) or neither of them.

  5. If you just want to make sure that the feature will not overflow while being engineered due to the range limitation, you may simply divide it by a constant, e.g. \frac{x}{10^{18}}, so it becomes safe to work with.
    a. squared first, then normalize: \frac{x^2 - mean(x^2)}{std(x^2)}
    b. scale first, then squared, then normalize: \frac{(\frac{x}{10^{18}})^2 - mean((\frac{x}{10^{18}})^2)}{std((\frac{x}{10^{18}})^2)} = \frac{x^2 - mean(x^2)}{std(x^2)}
    So such scaling makes no difference.


1 Like

Thank you for your reply!

Could you please clarify 4

In Linear regression for example, we want to make sure the engineered feature to be linear with the predicting label, and with that, we might use (2a), (2b) or neither of them

. In which circumstances using 2B might be appropriate, as I tried this approach in practice for previous version of this course, it led me to similar result with 2A.

The question is basically is 2B ever be used in practice or better to forget about it as it might lead to less precise/false model? Even though answer seems to be obvious - better to ask :slight_smile: thanks

Hi @Mikhail_Linkov,

Very good question! I am happy that you asked this! Let’s consider the following 3 models:

  1. y = w_1 x + b
  2. y = w_2 x^2 + b
  3. y = w_1 x + w_2 x^2 + b

We usually start from (6), but then we find that this is not good enough, so we want to engineer a new feature. The question is, do we want to change from (6) to (7), or from (6) to (8)? This is a decision we need to make and affects how we engineer features.

If we want to change from (6) to (7), then we can use (2a) but we can’t use (2b). Why? Because only (2a) will gives you just a x^2 term (and a constant term) and only (2a) will not give you a x term. We can rearrange (2b) to see more clearly that it will also give you a x term:

({\frac{x - mean(x)}{std(x)}})^2 = \frac{1}{{(std(x))}^2} x^2- 2\frac{mean(x)}{{(std(x))}^2}x + \frac{({mean(x)})^2}{{(std(x))}^2} = A x^2 + B x + C

However, if we want to change from (6) to (8), we can use either (2a) or (2b)!

So, back to you question:

Very likely not used in practice, because with (2a), it is sufficient for us to achieve both (7) and (8).

However, please bear with me for also mentioning about (2b) in my previous reply, because I don’t want to completely deny the possibility of using (2b).

As you pointed out, in your experience, using (2a) and (2b) can give you very similar result (I believe that to be under the model assumption of (8) instead of (7)), so if I had just said we don’t need (2b), then you might be confused because (2b) worked just as fine. I am glad to hear that you made a comparison between (2a) and (2b) yourself, and question about the need for having (2b).


1 Like

Hi @Mikhail_Linkov,

I forgot to mention one point. Having said that (2b) will give you a x term, does it mean that it’s sufficient for us to just use (2b) without explicitly x itself. In other words, are the following two models equvialent?

9a. y=w_2x_2+b
9b. y=w_1x + w_2x_2 + b

Note that, x_2 = ({\frac{x - mean(x)}{std(x)}})^2 which is (2b), and x is just the original x.

Answer is, no they are not equivalent. In (9b), there are two weights we can tune, but in (9a) there is only one weight, so (9b) has more freedom to fit itself to the data, and (9a) has less freedom. So while both might perform similarly, (9b) should be able to do a better job at fitting itself to the data.


Thank you very much! Now it’s crystal clear :slight_smile: Cheers!