Consequences of unbalanced input variable types

For the sake of simplicity, let say we have for input variables (X):

  • 5 continuous variables normalized to be in the range between 0 and 1
  • 10 binary variables, coded either 0 or 1

What is the consequence of such an unbalance?

  • The values of the 5 normalized continous variable are much more precise, having fine grain values like 0.2, 0.5, 0.7, reaching only in exceptional cases the maximal value of 1
  • On the other hand, having 10 binary variables outputing either the maximum value of 1 or the minum value of 0, might strongly disort and cover the fine grain value brought by the 5 normalized continuous variables

Any solutions or thoughts about this ?

1 Like

The results depend on the problem at hand. There’s nothing implicitly wrong in having more of continuous / categorical variables.
Please read this link to get a better idea on how to choose the features for your model.

1 Like

Thank you for this very interesting link @balaji.ambresh .

My concern was not really the number of them but a potential unbalance in the type. I trend to apply the same principles as in classical stats: the inclusion of independent variable or features/predictor for us should be theoretically grounded to avoid any spurious relationships, so I guess I should be fine with this.

After some thoughts, my concern might not be justified, as neurons will handle each features separetely and the later complexity of links between them should be able to ponder such relative unbalance.