K-means Cluster Analysis

Basic Algorithm

The objective of this algorithm is to partition a data set S consisting of n-tuples of real numbers into k clusters C1, …, Ck in an efficient way. For each cluster Cj, one element cj is chosen from that cluster called a centroid.

Definition 1: The basic k-means clustering algorithm is defined as follows:

  • Step 1: Choose the number of clusters k
  • Step 2: Make an initial selection of k centroids
  • Step 3: Assign each data element to its nearest centroid (in this way k clusters are formed one for each centroid, where each cluster consists of all the data elements assigned to that centroid)
  • Step 4: For each cluster make a new selection of its centroid
  • Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)

There are various choices available for each step in the process.

An alternative version of the algorithm is as follows:

  • Step 1: Choose the number of clusters k
  • Step 2: Make an initial assignment of the data elements to the k clusters
  • Step 3: For each cluster select its centroid
  • Step 4: Based on centroids make a new assignment of data elements to the k clusters
  • Step 5: Go back to step 3, repeating the process until the centroids don’t change (or some other convergence criterion is met)

Distance

There are a number of ways to define the distance between two n-tuples in the data set S, but we will focus on the Euclidean measure, namely, if x = (x1, …, xn) and y = (y1, …, yn) then the distance between x and y is defined by

image9248

Since minimizing the distance is equivalent to minimizing the square of the distance, we will instead look at dist2(x, y) = (dist(x, y))2. If there are k clusters C1, …, Ck with corresponding centroids c1, …, ck, then for each data element x in S, step 3 of the k-means algorithm consists of finding the value j which minimizes dist2(x, cj); i.e.

image9249

If we don’t require that the centroids belong to the data set S, then we typically define the new centroid cj for cluster Cj in step 4 to be the mean of all the elements in that cluster, i.e.

image9250

where mj is the number of data elements in Cj.

If we think of the distance squared between any data element in S and its nearest centroid as an error value (between a data element and its prototype) then across the system we are trying to minimize

image9251

Initial Choice

There is no guarantee that we will find centroids c1, …, ck that minimize SSE and a lot depends on our initial choices for the centroids in step 2.

Property 1: The best choice for the centroids c1, …, ck are the n-tuples which are the means of the C1, …, Ck,. By best choice, we mean the choice that minimizes SSE.

Click here for a proof of this property (using calculus).

Example

Example 1: Apply the second version of the k-means clustering algorithm to the data in range B3:C13 of Figure 1 with k = 2.

Cluster analysis k-means

Figure 1 – K-means cluster analysis (part 1)

The data consists of 10 data elements which can be viewed as two-dimensional points (see Figure 3 for a graphical representation). Since there are two clusters, we start by assigning the first element to cluster 1, the second to cluster 2, the third to cluster 1, etc. (step 2), as shown in range E3:E13.

We now set the centroids of each cluster to be the mean of all the elements in that cluster. The centroid of the first cluster is (2.6, 1.4) where the X value (in cell H4) is calculated by the formula =AVERAGEIF(E4:E13,1,B4:B13) and the Y value (in cell H5) is calculated by the worksheet formula =AVERAGEIF(E4:E13,1,C4:C13). The centroid for the second cluster (3.2, 3.0) is calculated in a similar way.

We next calculate the squared distance of each of the ten data elements to each centroid. E.g. the squared distance of the first data element to the first centroid is 7.72 (cell L4) as calculated by =(B4-H4)^2+(C4-H5)^2 or equivalently =SUMXMY2($B4:$C4,H$4:H$5). Since the squared distance to the second cluster is 12.24 (cell M4) is higher we see that the first data element is closer to cluster 1 and so we keep that point in cluster 1 (cell O4). Here cell K4 contains the formula =MIN(L4:M4) and cell O4 contains the formula =IF(L4<=M4,1,2).

Convergence

We proceed in this way to determine a new assignment of clusters to each of the 10 data elements as described in range O4:O13. The value of SSE for this assignment is 36.92 (cell H7). Since the original cluster assignment (range E4:E13) is different from the new cluster assignment, the algorithm has not yet converged and so we continue. We simply copy the latest cluster assignment into the range E16:E25 and repeat the same steps.

After four steps we get convergence, as shown in Figure 2 (range E40:E49 contains the same values as O40:O49). The final assignment of data elements to clusters is shown in range E40:E49. We also see that SSE = 22.083 and note that each step in the algorithm has reduced the value of SSE.

K-means algorithm Excel

Figure 2 – K-means cluster analysis (part 2)

Graph of Assignments

Figure 3 graphically shows the assignment of data elements to the two clusters. This chart is created by highlighting the range B40:C49 and selecting Insert > Charts|Scatter.

Cluster analysis chart

Figure 3 – Cluster Assignment

You can add the labels (1 and 2) to the points on the chart shown in Figure 3 as follows. First, right-click on any of the points in the chart. Next, click on the Y Value option in the dialog box that appears as shown in Figure 4.

Format data labels

Figure 4 – Adding labels containing cluster assignment

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

PennState (2015) K-Mean procedure. STAT 505: Applied Multivariate Statistical Analysis
https://online.stat.psu.edu/stat505/lesson/14/14.8

Wilks, D. (2011) Cluster analysis
http://www.yorku.ca/ptryfos/f1500.pdf

Wikipedia (2015) K-means clustering
https://en.wikipedia.org/wiki/K-means_clustering

49 thoughts on “K-means Cluster Analysis”

  1. Hello.
    Thank you for providing Add-in program for free.
    I have a question about setting the numbers for k-means clustering. What does”Number of replications” stands for and how can I set the proper value?
    Also, could you explain briefly about “Weight range”, please?

    Thank you.

    Reply
    • Hello,
      Excellent questions and I apologize for not being clearer.
      I have now modified the webpage by adding two final sections called Initializing Clusters and Minkowski Distance. These sections will address the questions you have raised.
      Thank you very much for bringing this issue to my attention.
      Charles

      Reply
  2. Sorry I did not make myself clear. In this cluster analysis, the final groups (1 and 2) have the minimized possible variance within groups and the maximized possible variance between groups (as I understand). I would like to perform a clustering in which the variance between the groups is the minimum.

    Reply
    • When you speak about the variance between groups, are you referring to the pooled variance (as for the two independent sample t-test?
      My first reaction is to randomly assign the data to the two clusters, but this may or may not be the correct way to minimize the variance.
      Charles

      Reply
  3. Hi Charles,

    Is it possible to perform a cluster in which all the final groups have the most possible similar mean of each subject (X and Y)?

    Thank you!

    Reply
  4. Thank you for the great program.
    K means cluster analysis, 220 lines of data, 4 clusters, 20 iterations: how long should the analysis take? I have been waiting for 10 minutes and I’m not sure if EXCEL crashed or not 🙂

    Reply
    • Hello Vincent,
      I suggest that you try again with iterations set to 1 and see whether you have the same problem. If not increase the number of iterations a bit to see what happens.
      Charles

      Reply
      • Hi Charles, I let it run the whole night and I think it had crashed. Tried again with 100 iterations and 4 clusters and it worked. Tried with 20 clusters and 10 iterations and it worked too. Strange. Thank you for your reply!

        Reply
  5. Hello Charles,
    thank you for this important resource.

    I am new to advanced statistics and I have a question:
    let’s say I have 3 variables and I want to split those istances into 6 clusters.
    Using your tool the result give me every x,y,z point labeled with the calculated cluster.
    Is there an easy way to know the boundaries of those clusters? I mean the range of x,y,z that characterize each cluster?
    I don’t know if I’ve been clear enough.
    Thank you again.

    Reply
  6. Hi Charles
    Can you explain to me how I can run a clusterization algorithm with 3 centroids and can I at all (I have 10 points from A to J with 2 coordinates x and y)?

    Reply
    • This is explained on this webpage. You can perform this cluster analysis with 10 points, each with 2 coordinates, using 3 centroids.
      You can also use the Real Statistics software to perform most of the work for you.
      Charles

      Reply
  7. Hello Charles,
    When you assign items to different groups or clusters, you then calculate the new centroids. How do you deal with a group that has been assigned no item at all?
    Apart from the trivial issue of a division by zero, how do you choose new centroids?
    Many thanks!
    Fred

    Reply
  8. Hi Charles,

    I just wanted to let you know that I found a small typo in your text.

    “Figure 3 graphically shows the assignment of data elements to the two clusters. This chart is created by highlighting the range B40:B49 and selecting Insert > Charts|Scatter.”

    I think you mean B40:C49. Please correct me if I am wrong.

    Reply
    • Anders,
      Thanks for identifying this typo. I have now made the change that you have suggested.
      I really appreciate your help in improving the website. Thanks to you and others like you, the website is constantly reviewed and errors are identified and corrected.
      Charles

      Reply
  9. good day dr. charles
    im using k-means clustering using WEKA and i verified the result using excel..the result did not match..is this possible? please help
    my data sample is this

    Z Grade1 Grade2 Grade 3 Final Remark
    1 78 85 82 80 Passed
    0 80 94 90 88 Passed
    etc……..

    Reply
    • by the way ….if i omit the REMARK attribute in WEKA..the result is similar to excel..but if it is included in the data set..the result is not the same…thank you in advance for your help

      Reply
  10. Dear Charles,
    What limitations are build in for using the k-means cluster calculation?
    I have a time series with 7707 elements (temperature, opening degree of valve).
    I have been testing with 2, 3, 4 and 5 clusters and the calculations were working fine, but suddenly at 6 clusters i only got errors back from excel!
    in de the calculated cluster column it was #value and in the other tables it was #div/0

    Can you give me an indication to solve the problem

    Reply
    • Dear Thiery,
      The number of calculation grows very fast as the number of clusters increases. You should expect that the calculations should start to slow down a lot at 6 clusters. I am surprised that an error was generated.
      If you send me an Excel file, I will try to figure out what is going on. You can find my email address at Contact Us.
      Charles

      Reply
  11. Thank you so much.

    Could you please use the same example and show to us how to calculate
    matrix SSB and also SSW and pseudo F test ?

    I would appreciate very much.
    Sanja

    Reply
  12. Please advise me. If I have big data for example iris data and I would like to divide it into four clusters. How am I going to choose the centroids or calculate it?

    Reply
    • Dipak,
      The formula to calculate the SSE value in cell H7 is =SUM(K4:K13). The mathematical formula is given just before Property 1 on the referenced webpage.
      Charles

      Reply

Leave a Comment