Great question! Here is how the reduction works. Initially, we can use up to 255 * 255 * 255 = 16,581,375 colors to represent the whole color space. Ofcourse, the photo does not use each and every color, so the actual number of used colors is less.
Although it is less, it’s still a very large number, so what if we can reduce that number to something like 16, meaning we pick 16 colors that are most useful to represent the photo among all the 16,581,375 possible choices?
Achieving this would be awesome because after that we only need to remember 16 colors (instead of 16,581,375), and at each pixel, we no longer need three (0-255)-ranged numbers to talk about a color in terms of its RGB values, instead we only need one (0-15)-ranged number to talk about the colors. For example, 0 can represent red, 1 can represent white, 3 can represent brown, and so on. The numbers are no longer a continuous representation of how the intensity of one of the Red / Green / Blue changes, but each of the number from 0 to 15 represents a discrete color and is representative for the photo in question. Note that here, it would only take 4 bits (not 16 bits) to remember 16 distinct numbers.
Shape of centroid is (16, 3) because there are 16 centroid and each centroid is a color represented by its R, G, and B values.
Shape of Idx is (16384, ) because the flattened photo has 16384 pixels, and idx contains for each pixel the ID (between 0 and 15) of the closest centroid, which means it talks about at each pixel which of the 16 colors is going to be used to represent it.
centroids[idx, :]
will return you an array of length (16384, 3), because it “recursively” takes the centroid for each pixel out of the centroids
array.
Raymond