C2_W1_Assignment: Interaction terms and distribution assessment

It seems the interaction terms were introduced without a follow-up assessment of their distribution. While the original features were standardized, their products can still exhibit non-zero means, inflated variances, and skewed, non-Gaussian shapes—especially when the inputs are correlated or retain residual kurtosis.

Given that, I believe it would have been appropriate to evaluate the interaction terms for skewness and re-standardize them accordingly. Please feel free to correct me if I’ve misunderstood.

Additionally, I’m curious why the transformation strategy defaulted to log. Wouldn’t it be more robust to explore a range of options—such as log1p, square root, cube root, Box-Cox, or Yeo-Johnson—and select the best transformation for each feature based on a metric like skewness? That way, each variable gets the treatment most suited to its distribution.

Looking forward to your thoughts.

Thanks, Franck

hi @francktchafa

The interactive terms you mentioned surely are some of the famous transformation technique used in highly skewed data but box cox doesn’t take up negative and 0 values where as yoe johnson does take up negative as well as 0 values, but is highly sensitive to small sample size. Also not to forget the assignment you are working on is a linear model and if you noticed the assignment clearly mentions first the data skewness is removed and then data distribution is standardize.

Transformations like Box-Cox and Yeo-Johnson introduce a power parameter (lambda) that optimizes the data’s distribution towards normality. This parameter can be non-intuitive and difficult to explain to non-technical stakeholders, especially when the goal of the model is not just prediction but also understanding the underlying relationships.

Parameter Estimation: For Box-Cox and Yeo-Johnson, the optimal lambda parameter is typically estimated from the training data. If the distribution of unseen data differs significantly from the training data, applying the same lambda might not achieve the desired transformation or could even introduce new biases.

Handling New Values: Log and square root transformations have restrictions (for example log is undefined for non-positive values, square root for negative values). While log1p addresses the zero issue for log, if unseen data contains values outside the expected range or with different characteristics than the training data, these transformations might fail or produce unexpected results.

Lastly to maintain model Assumptions: While transformations aim to satisfy model assumptions (like normality of residuals), if the unseen data’s distribution is vastly different, the transformation might not effectively normalize it, potentially violating the assumptions of the trained model and leading to less accurate predictions.

My overall understanding of the assignment model is that it is much simpler model where skewness was removed and then standardized before creating a binary model to detect if patient has retinopathy or not based on the selected features.

Regards

DP

@Deepti_Prasad Thanks for your insight.

Would it be reasonable to assess interaction terms separately for skewness and then re-standardize them post-transformation, given that their distributions can diverge from those of the original features—even in a simple linear model?

As I said earlier ofcourse this can be done when you have a larger sample size but remember the assignment we are working upon has a very small sample size. These transformation techniques can be used if you have a highly skewed data with a larger sample size.