Order of applying feature engineering and normalization

Mikhail_Linkov · July 12, 2022, 4:15pm

Hi, from example in the course it seems that feature engineering should be performed before normalization.
But what happens if I change the order of operations? first - normalization and after that - engineering.
For example there is feature X(i) with values in range [-10^9, 10^9]. If I add new feature based on existing one, range for square value will be already [-10^18, 10^18].

rmwkwok · July 13, 2022, 3:07am

Hi Mikhail,

Very interesting question!

True that a number is bounded by a limited range, and we can’t go beyond that
a. np.float32: -3.40e+38 to 3.40e+38
b. np.float64: -1.79e+308 to 1.79e+308
However, switching the order doesn’t produce the same outcome:
a. squared first, then normalize: \frac{x^2 - mean(x^2)}{std(x^2)}
b. normalize first, then squared: ({\frac{x - mean(x)}{std(x)}})^2 = \frac{x^2 - 2xmean(x) + ({mean(x)})^2}{{(std(x))}^2}
Also, even we do (2b), we will still need to normalize it afterwards, and that won’t bring us back to (2a) either.
Given that they don’t produce the same outcome, we need to decide which outcome is needed. In Linear regression for example, we want to make sure the engineered feature to be linear with the predicting label, and with that, we might use (2a), (2b) or neither of them.
If you just want to make sure that the feature will not overflow while being engineered due to the range limitation, you may simply divide it by a constant, e.g. \frac{x}{10^{18}}, so it becomes safe to work with.
a. squared first, then normalize: \frac{x^2 - mean(x^2)}{std(x^2)}
b. scale first, then squared, then normalize: \frac{(\frac{x}{10^{18}})^2 - mean((\frac{x}{10^{18}})^2)}{std((\frac{x}{10^{18}})^2)} = \frac{x^2 - mean(x^2)}{std(x^2)}
So such scaling makes no difference.

Cheers,
Raymond

Mikhail_Linkov · July 14, 2022, 4:33pm

Thank you for your reply!

Could you please clarify 4

In Linear regression for example, we want to make sure the engineered feature to be linear with the predicting label, and with that, we might use (2a), (2b) or neither of them

. In which circumstances using 2B might be appropriate, as I tried this approach in practice for previous version of this course, it led me to similar result with 2A.

The question is basically is 2B ever be used in practice or better to forget about it as it might lead to less precise/false model? Even though answer seems to be obvious - better to ask thanks

rmwkwok · July 15, 2022, 1:58am

Hi @Mikhail_Linkov,

Very good question! I am happy that you asked this! Let’s consider the following 3 models:

y = w_1 x + b
y = w_2 x^2 + b
y = w_1 x + w_2 x^2 + b

We usually start from (6), but then we find that this is not good enough, so we want to engineer a new feature. The question is, do we want to change from (6) to (7), or from (6) to (8)? This is a decision we need to make and affects how we engineer features.

If we want to change from (6) to (7), then we can use (2a) but we can’t use (2b). Why? Because only (2a) will gives you just a x^2 term (and a constant term) and only (2a) will not give you a x term. We can rearrange (2b) to see more clearly that it will also give you a x term:

({\frac{x - mean(x)}{std(x)}})^2 = \frac{1}{{(std(x))}^2} x^2- 2\frac{mean(x)}{{(std(x))}^2}x + \frac{({mean(x)})^2}{{(std(x))}^2} = A x^2 + B x + C

However, if we want to change from (6) to (8), we can use either (2a) or (2b)!

So, back to you question:

Very likely not used in practice, because with (2a), it is sufficient for us to achieve both (7) and (8).

However, please bear with me for also mentioning about (2b) in my previous reply, because I don’t want to completely deny the possibility of using (2b).

As you pointed out, in your experience, using (2a) and (2b) can give you very similar result (I believe that to be under the model assumption of (8) instead of (7)), so if I had just said we don’t need (2b), then you might be confused because (2b) worked just as fine. I am glad to hear that you made a comparison between (2a) and (2b) yourself, and question about the need for having (2b).

Cheers,
Raymond

rmwkwok · July 15, 2022, 3:51am

Hi @Mikhail_Linkov,

I forgot to mention one point. Having said that (2b) will give you a x term, does it mean that it’s sufficient for us to just use (2b) without explicitly x itself. In other words, are the following two models equvialent?

9a. y=w_2x_2+b
9b. y=w_1x + w_2x_2 + b

Note that, x_2 = ({\frac{x - mean(x)}{std(x)}})^2 which is (2b), and x is just the original x.

Answer is, no they are not equivalent. In (9b), there are two weights we can tune, but in (9a) there is only one weight, so (9b) has more freedom to fit itself to the data, and (9a) has less freedom. So while both might perform similarly, (9b) should be able to do a better job at fitting itself to the data.

Cheers,
Raymond

Mikhail_Linkov · July 15, 2022, 11:31am

Thank you very much! Now it’s crystal clear Cheers!

Topic		Replies	Views
What is feature engineering? Supervised ML: Regression and Classification week-module-2	1	409	August 9, 2023
Feature Engineering - please help understand this Supervised ML: Regression and Classification week-module-2	4	817	July 31, 2022
Regression with flattened statistics Supervised ML: Regression and Classification week-module-3	24	617	February 8, 2023
Practice quiz: Gradient descent in practice Q4 Supervised ML: Regression and Classification week-module-2	4	979	February 7, 2023
W3 assignment feature mapping: higher order? Supervised ML: Regression and Classification week-module-3	7	299	October 17, 2023

Order of applying feature engineering and normalization

Related topics