How different initialization of centroids of K-means results in drastic different clusters ? They all share common cost function

In the lecture, we studied that there is local optima and global optima. Also, one initialization will lead to local while other can lead to global optima.

I do not understand how that happens. Is there a mathematical proof for that ?

Hi, @Tanvi

It has to do with the data itself if your clusters have a well-defined boundary between them it will be confident that you will always find the same centroids and only will depend on the parameter k.

but if your data does not have clear boundaries between each cluster so initializing the centroid randomly will give different results depending on the initialization.

recall that you initialize the centroids randomly and keep updating them until they settle at a good point.

my suggestion is to use the random_state parameter with a single value to guarantee that you get the same centers each time you use your model.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
...
1 Like

@Tanvi, is there any mechanism for us to drive away from a local optima?

Hi there,

please bear in mind that a training process is in general not deterministic but rather stochastic, e.g. due to:

  • sampling from your training data (batches), resp. stochastic gradient descent
  • dropout (where randomly Neurons are „shut off“) for regularization
  • initialisation (if you sample from a distribution)
  • etc.

So in general in your training process there is some randomness.

Dependent on your actual problem and the mechanics behind your loss function, you might encounter several local minima. Here your solver can get „stuck“ due to the above described randomness [1]. If you do it several times you might end up in different optima.

Note: you can prevent this with choosing a pseudo-random seed to ensure repeatability [2]. (At least you will get the identical result from your optimization routine. But it’s not clear whether or not this is the global optimum. Frankly speaking, in reality this might not even be required as long as your model is sufficient in fulfilling your business needs and is robust enough [does not overfit].)

Useful links:

Best
Christian

In addition, solvers like ADAM [Adaptive Moment Estimation] utilize „Momentum“ of the optimization trajectory to overcome local minima in order to get to the global optimum eventually.


Best
Christian

2 Likes

Hi, Raymond

Recently I started taking a course titled Convex Optimization for Machine Learning where the convex here describes functions that only has a single minimum which implies that it has a Global Minima. in the lecture I learned that not all function is convex in nature but there is a way that you can convert a function which is not convex in nature from on set to another and in this new set this function can be considered a convex function. one of the methods is called Convex Hull which is defined as the smallest convex set that contains this function in it. if we could find such a Hull we then can convert our function to be convex and be able to find its global minimum easily.

Hello @Moaz_Elesawey,

Convex relaxation? Is the course able to give a simple / practical example that talks about , before and after applying the technique, what difference does it make? Can you share the example?

Raymond

Hi Raymond

For now, we only just scratch the surface of Convex Optimization all of what we take is just theoretical and very mathematical (which I am not very good at :sweat_smile:).

If you want the materials and Note I have them in this GitHub Repo

I really don’t have any practical example in mind now, but the ultimate goal of this technique is will enable us to find the minima easier because as I said Convex means that it only has one minimum which means it’s the Global.

Hello Moaz,

Thank you for sharing the repo with me. I read the notes and seems the best is yet to come. So far those chapters seem only for setting the stage. Lagrange multiplier is a very interesting stuff :wink: and I had learnt it when I studied Lagrangian Mechanics.

Raymond

Hi, Raymond

You’re welcome,. You’re totally right we have not yet go very deep into the subject we barely touched it in the last lecture.

I hope it will be easy to study as mathematical foundation is not that great.

Best regards,
Moaz

Moaz, let me know if there is any maths stuff in those chapters that you want to discuss. :wink:

Raymond

1 Like

Thank you very much Raymond, I really appreciate it. I will let you know if I need any help.

Best regards,
Moaz

Hello @Moaz_Elesawey,

I come across this video explaining Lagrange multiplier visually.

Raymond

1 Like

Hi, @rmwkwok

I am very thankful to you and I really appreciate your help. I will certainly watch it.
Thank you again.

Sincerely,
Moaz

1 Like

Hello @Moaz_Elesawey,

Me again. I am recently reading about lagrange multiplier again and suddenly realized that regularization terms can just be an example of using lagrange multiplier. I found this discussion on google.

I think it’s nice to connect together 2 topics that we have learnt.

Cheers,
Raymond

1 Like