My doubt is that why aren’t we using the well known z-score normalisation or mean normalisation, is there any special reason as to why we are only subtracting with mean and not dividing with either mean or range.
Regards
My doubt is that why aren’t we using the well known z-score normalisation or mean normalisation, is there any special reason as to why we are only subtracting with mean and not dividing with either mean or range.
Regards
PCA doesn’t care about the magnitude of the features - only that they have a zero mean.
You can use any scaling you want.
In the context of recommender systems, especially collaborative filtering, the normalization approach used is often designed to fit the problem’s requirements and the characteristics of the data. Here, we are subtracting the mean \mu_i from each value y(i, j), which corresponds to mean normalization. The decision to use simple mean subtraction for normalization, rather than more complex methods such as z-score or min-max, is driven by practical considerations. This approach is particularly beneficial when dealing with sparse data, such as in collaborative filtering, where the dataset is often sparse (many users didn’t rate many items), and simple mean subtraction works even when many values are missing, as long as you have enough data to compute the mean for each user/item.
For robustness, dividing by the standard deviation (as in z-score normalization) can cause issues if there are outliers or if the variance is low for a particular user/item. If a user has rated only a few items and those ratings are close together, the standard deviation would be very small, leading to inflated normalized values. Mean normalization avoids this by not including any scaling factor beyond subtracting the mean.
In addition, many recommender systems focus on learning a model of user preferences relative to the mean rating, which can be more stable and interpretable. Subtracting the mean makes it easier to decompose the ratings into latent factors (user preferences and item characteristics) without adding the complexity of scaling by standard deviation or range.
As @TMosh mentioned, PCA doesn’t inherently care about magnitude but does require zero mean data. The scaling decision (e.g. z-score or min-max) depends on your data and the problem you’re trying to solve, as it affects how much each feature contributes to the principal components.
@nadtriana I know you are good at this so I am willing to offer myself up as a sacrifice-- I still haven’t wrapped my mind around when you chose L1/L2, or like ridge or not.
If you can edify on this point (aside from me), I think it would help inform someone.
I guess I am not seeing/understanding.
*Sacrifice because I ask a ‘dumb question’
Let me say that there are no “dumb questions”, especially when discussing complex topics like L1/L2 regularization. First, L1 regularization (Lasso) tends to shrink some weights to exactly zero. If you have many features, Lasso can automatically select the most important ones by eliminating the less important (zero-weighted) ones. Second, the L2 regularization (Ridge) regularization shrinks the weights, but doesn’t necessarily make them zero. Instead, it tends to reduce all the weights, pulling them closer to zero, but still keeping all the features in the model.
Think of Ridge (L2) as someone trimming the edges of a bush - gradually reducing all the parts, but still keeping all the leaves. On the other hand, Lasso (L1) is more like a pruning tool that cuts off entire branches, leaving only a sparse, minimalist version of the bush.
Sometimes, both penalties (L1 and L2) are useful, which gives us the Elastic Net. This method combines L1 and L2 penalties, giving you a balance between Ridge’s continuous shrinkage and Lasso’s feature selection. If you’re not sure which to use, Elastic Net might be a good middle ground that balances their strengths.
@nadtriana I feel this suffices to expand my understanding. I honestly did not know. Thank you. Now I get to go back painting the house.