In the kmean formula:

why the formula is expressed as a vector when the point is representative in a matrix? point = [x1, x2]

In the kmean formula:

why the formula is expressed as a vector when the point is representative in a matrix? point = [x1, x2]

Sorry, I’m not sure what the issue (or in a sense, difference) is in representing a point as a vector ?

Or, strictly speaking, [x1, x2] *is* a vector, not a matrix, because, at least in those terms we only have one ‘dimension’.

Secondly, I believe to be more technically correct, given the linear algebra methods we apply to it, it makes more sense to refer to it as a vector. At its most minimum, a vector can be seen as a reference to a point.

Hi @gmazzaglia

A good question, in the screenshot shared by you it explain K-means being a vector quantization method in which an iterative process of assigning each data point to the groups and slowly data points get clustered based on similar features.

Also a vector is **an array of numerical values that expresses the location of a floating point along several dimensions**

`idx`

= kmeans(`X`

,`k`

) performs*k*-means clustering to partition the observations of the*n*-by-*p*data matrix`X`

into`k`

clusters, and returns an*n*-by-1 vector (`idx`

) containing cluster indices of each observation. Rows of`X`

correspond to points and columns correspond to variables.

By default, `kmeans`

uses the squared Euclidean distance metric and the *k*-means++ algorithm for cluster center initialization.

Cluster indices, returned as a numeric column vector. `idx`

has as many rows as `X`

, and each row indicates the cluster assignment of the corresponding observation.

`idx`

= kmeans(`X`

,`k`

,`Name,Value`

) returns the cluster indices with additional options specified by one or more`Name,Value`

pair arguments.

For example, specify the cosine distance, the number of times to repeat the clustering using new initial values, or to use parallel computing.

`[idx`

,`C`

] = kmeans(___) returns the`k`

cluster centroid locations in the`k`

-by-*p*matrix`C`

.

Cluster centroid locations, returned as a numeric matrix. `C`

is a `k`

-by-*p* matrix, where row *j* is the centroid of cluster *j* . The location of a centroid depends on the distance metric specified by the `Distance`

name-value argument.

`[idx`

,`C`

,`sumd`

] = kmeans(___) returns the within-cluster sums of point-to-centroid distances in the`k`

-by-1 vector`sumd`

.

Within-cluster sums of point-to-centroid distances, returned as a numeric column vector. `sumd`

is a `k`

-by-1 vector, where element *j* is the sum of point-to-centroid distances within cluster *j* . By default, `kmeans`

uses the squared Euclidean distance (see `'Distance'`

metrics).

`[idx`

,`C`

,`sumd`

,`D`

] = kmeans(___) returns distances from each point to every centroid in the*n*-by-`k`

matrix`D`

.

Distances from each point to every centroid, returned as a numeric matrix. `D`

is an *n* -by-`k`

matrix, where element (*j* ,*m* ) is the distance from observation *j* to centroid *m* . By default, `kmeans`

uses the squared Euclidean distance (see `'Distance'`

metrics).

Regards

DP

Thanks @Deepti_Prasad , very clear.

This I wanted to understand.

idx = kmeans(X,k) performs k-means clustering to partition the observations of the **n-by-p data matrix X** into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each observation. Rows of X correspond to points and columns correspond to variables.

Regards.

Gus