K-mean question

In the kmean formula:

why the formula is expressed as a vector when the point is representative in a matrix? point = [x1, x2]

Sorry, I’m not sure what the issue (or in a sense, difference) is in representing a point as a vector ?

Or, strictly speaking, [x1, x2] is a vector, not a matrix, because, at least in those terms we only have one ‘dimension’.

Secondly, I believe to be more technically correct, given the linear algebra methods we apply to it, it makes more sense to refer to it as a vector. At its most minimum, a vector can be seen as a reference to a point.

Hi @gmazzaglia

A good question, in the screenshot shared by you it explain K-means being a vector quantization method in which an iterative process of assigning each data point to the groups and slowly data points get clustered based on similar features.

Also a vector is an array of numerical values that expresses the location of a floating point along several dimensions

  1. idx = kmeans(X,k) performs k-means clustering to partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each observation. Rows of X correspond to points and columns correspond to variables.

By default, kmeans uses the squared Euclidean distance metric and the k-means++ algorithm for cluster center initialization.

Cluster indices, returned as a numeric column vector. idx has as many rows as X , and each row indicates the cluster assignment of the corresponding observation.

  1. idx = kmeans(X,k,Name,Value) returns the cluster indices with additional options specified by one or more Name,Value pair arguments.

For example, specify the cosine distance, the number of times to repeat the clustering using new initial values, or to use parallel computing.

  1. [idx,C] = kmeans(___) returns the k cluster centroid locations in the k -by-p matrix C .

Cluster centroid locations, returned as a numeric matrix. C is a k-by-p matrix, where row j is the centroid of cluster j . The location of a centroid depends on the distance metric specified by the Distance name-value argument.

  1. [idx,C,sumd] = kmeans(___) returns the within-cluster sums of point-to-centroid distances in the k -by-1 vector sumd .

Within-cluster sums of point-to-centroid distances, returned as a numeric column vector. sumd is a k-by-1 vector, where element j is the sum of point-to-centroid distances within cluster j . By default, kmeans uses the squared Euclidean distance (see 'Distance' metrics).

  1. [idx,C,sumd,D] = kmeans(___) returns distances from each point to every centroid in the n -by-k matrix D .

Distances from each point to every centroid, returned as a numeric matrix. D is an n -by-k matrix, where element (j ,m ) is the distance from observation j to centroid m . By default, kmeans uses the squared Euclidean distance (see 'Distance' metrics).

Regards
DP

Thanks @Deepti_Prasad , very clear.

This I wanted to understand.

idx = kmeans(X,k) performs k-means clustering to partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each observation. Rows of X correspond to points and columns correspond to variables.

Regards.
Gus