K-means distortion cost function

I’ve lived a lie, all my life I knew this thing as variance, as an estimator of a set of data points, now it ends up having a much fancier name.

Am I getting something wrong or what it tries to minimize is the variance of the closest data points? Thanks in advance

Hello @tgolvano,

No, it is not a lie. Here we are going to discuss a thing with three of their properties: name, appearance, and purpose.

Variance and the K-means’ cost has a different name, as you pointed out; very similar appearance, which perhaps is why you have posted this; but a very different purpose, which we should also consider in order for us to conclude that it is not a lie.

We can skip the names for now.

For the appearance, generally speaking, x^{(i)} is a vector, and the variance of a vector (actually called the variance-covariance matrix of a vector) is NOT in the form that is shown there. I do not want to get to that details here but if you search for it, you will see something very different. At least that it is matrix for the real variance of a vector; BUT it is a scalar for the cost for K-means. The matrix is only reduced to also a scalar only if x^{(i)} is a scalar, and in this single, very special scenario might you refer the cost as the sum of variance of different groups. BUT it is a very special scenario which means it is generally NOT TRUE that the variance of x^{(i)} is mathematically equal to the cost for K-means.

Now comes the purposes. They have different names which reflect their different purposes. When x^{(i)} is a vector, the (I am going to use the actual name from now on) variance-covariance matrix speaks about the variance of each dimension of the vector, and the covariance between any two dimensions of the vector. The cost function for K-means, however, measure the distances between each x^{(i)} to its group center \mu_c^{(i)} by taking the L2-Norm (the \|..\| thing) which effectively ALWAYS reduces all dimensions into a scalar. We use the scalar to determine collectively how close the data points are to their respective centers. This is the different purpose.

All in all, they are different.

It might take you reading my response for a few times because I am trying to discuss the differences without getting too in-depth of them, so you might need to “close some loops” yourself.

Let me know if you have any follow-ups.