I just completed the K-Means lab assignment. I noted that when I changed the colors on the video image compression to a high number, the program is not able to get above about 240 colors. I tried it as is in the Jupyter notebook and also wrote the same program in Python with (approximately) the same result. I could do 244 colors but not 245 in the straight Python program. I get two non-fatal errors when I get too high in the number of colors (K value). The program runs but the compressed image does not display properly.
Also, in the lecture, the cost was discussed but we did not monitor it in the algorithm. I was wondering what you use as your termination criteria if you are not monitoring cost? How many iterations are enough?
For the monitoring the cost part,
you can use a stopping criteria like if the model reaches a certain percentage of accuracy but this it not implemented by hand in python there are some libraries that can do this for you. search for the callback function in tensorflow which does what you asked for.
What changes did you make to set 245 colors? Can you share the lines of code regarding the changes in some screenshots?
# Run your K-Means algorithm on this data
# You should try different values of K and max_iters here
**K = 246**
max_iters = 10
# Using the function you have implemented above.
initial_centroids = kMeans_init_centroids(X_img, K)
# print(initial_centroids)
# Run K-Means - this takes a couple of minutes
centroids, idx = run_kMeans(X_img, initial_centroids, max_iters)
What are those errors? Can you share the error messages in some screenshots?
Please see uploaded screenshot
I noticed also that the idx array has all 1’s when this happens…
i see. Dave, the first thing I would do to debug is to print the outcome of the K-means algorithm which are centroids and idx. It is likely that you will see many np.nan which represents something anomalous happened in the algorithm.
The next check I would do is to insert some print lines inside run_kMeans to print the content of, again, centroids and idx but this time we are observing how they change from iteration to iteration.
In this way, you will be able to see at which iteration np.nan starts to appear. I recommend you to do these checks and think it through, and you should come to the following conclusions:
a np.nan appears in the variable centroids as soon as one of the candidate centroids is not found to be closest to any pixel. The chance of having this problem goes up with larger K, or more centroids. This is why you may have this problem when K = 246. I tried two runs with one success and one failed. A large K is not destined to fail, but much higher chance of failure than a small K.
once a np.nan appears, in the next iteration, almost all others in the variable centroids will become np.nan too and that is due to the use of np.argmin in the function of find_closest_centroids.
To handle the above concluded situation better, we need a way to avoid the appearance of the first np.nan, which means we should think about what to do if a centroid is found to be not close to any pixel. This is an open question and although I can share 2 possible solutions with you, you might not understand what I am talking about unless you investigate the problem and understand the underlying cause as well as understand the algorithm. The 2 possible solutions are:
replace that centroid that’s not found to be close to any pixel with another randomized one
keep the centoid unchanged instead of letting it become np.nan.
======================================
To print the cost, I think you only need to implement the cost formula which is covered in the video “Optimization objective” in course 3 week 1, and then run the cost formula inside the for-loop of run_kMeans so that it will print the cost once per loop and you can monitor it. I will leave the implementation to you.
As a follow up, I ran 10 iterations with many (20 or so) incidents of ‘np.isnan’ on the first iteration, 1 incident on iterations 2-5 and 0 on iterations 6-10 so the routine appears to be a solid routine.
Dave @Dave_Bostain , that is an interesting observation. I have not investigated it in detail but your observation looks reasonable to me.
If I have to guess, it’s easier to see np.nan in the first iteration but not afterwards because:
the initial set of centroids is by randomly picking K pixels and the chance that some of them share the same colors is higher when K is larger
If centroid number 1 and 2 share the same color, find_closest_centroids always select centroid number 1 instead of 2, so that at the end, centroid number 2 has no cluster member, and centroid 2 ends up appearing np.nan at compute_centroids
Therefore, “same-color centroids” is a mechanism of generating np.nan
The key problem is the centroid initialization process favors the mechanism, so we see more np.nan in the first iteration as you reported.
After the first iteration, however, centroids are re-computed by taking means of pixel colors, and mean values are less likely to be the same, and therefore not favoring the mechanism, so we should see less np.nan afterwards.
You said “many (20 or so) incidents of ‘np.isnan’ on the first iteration”, considering you are picking 2000 centroids from 128 \times 128 pixels which is roughly 1 in 8, and that adjacent pixel tends to have the same color, “many incidents” is not surprising because of the mechanism.
You said “1 incident on iterations 2-5”, it is again reasonable that the mechanism plays much less significant role, and that single incident may be caused by the mechanism, or other minor mechanism.
Actually, if my guess is supported, it would probably suggest us to change how we initialize centroids - we should avoid redundant centroids in the first place because it is more reasonable to ask for K distinct centroids than K centroids.
However, even if we have avoided redundant centroids, this only helps remove one mechanism and that mechanism in the first iteration. Being able to handle np.nan is more prominent, and so the change you have made to your algorithm should still be kept.