Cluster Analysis

Given a data set S, there are many situations where we would like to partition the data set into subsets (called clusters) where the data elements in each cluster are more similar to other data elements in that cluster and less similar to data elements in other clusters. Here “similar” can mean many things. In biology, it might mean that the organisms are genetically similar. In marketing, it might mean that customers are in the same market segment.

Types of clustering

Clustering can also be hierarchical, where clustering is done at multiple levels. Here the data set is divided into clusters and these clusters are in turn further divided into more finely granular clusters. The biological classification system (kingdoms, phylum, class, order, family, group, genus, species) is an example of hierarchical clustering.

In this part of the website, we will describe a form of prototype clustering, called k-means clustering, where a prototype member of each cluster is identified (called a centroid) which somehow represents that cluster. The approach we take is that each data element belongs to the cluster whose centroid is nearest to it; i.e. which minimizes the distance between that data element and that cluster’s centroid.

Typically our data elements will be n-tuples. These can be thought of as points in n-space or as n-dimensional vectors. Each dimension can represent some characteristic of the data elements under consideration. Distance will typically be the Euclidean distance (or a weighted version of this distance) as described below. Thus data elements that most share these characteristics will belong to the same cluster.

We can also use k-means clustering for one-dimensional data. But we also describe a simpler, one-dimensional clustering algorithm called Jenks Natural Breaks.

Topics

Also see Gaussian Mixture Models.

References

Johnson, R. A., Wichern, D. W. (2007) Applied multivariate statistical analysis. 6th Ed. Pearson
https://www.webpages.uidaho.edu/~stevel/519/Applied%20Multivariate%20Statistical%20Analysis%20by%20Johnson%20and%20Wichern.pdf

Rencher, A.C., Christensen, W. F. (2012) Methods of multivariate analysis (3nd Ed). Wiley

27 thoughts on “Cluster Analysis”

  1. Hello Charles,
    Is it possible to include data mining methods such as decision trees/apriori in your awesome Add-in?
    We also need to know the optimal number of clusters
    Boris

    Reply
  2. Hi Charles,

    I was expecting the number of clusters to converge. It doesn’t seem to be the case for my dataset, i.e. the optimal number of clusters is the number of data rows.

    Is this always the case? How to determine the optimal number of clusters then?

    Reply
    • Hello Vincent,
      The optimal number of clusters probably depends on how you will use the information. How would you define the optimal number of clusters?
      Charles

      Reply
  3. Hi Charles
    Congratulations for developing this tool!

    Just wondering if there is simple way to scale data using your package before running k-means? If not could be future upgrade..?

    Reply
    • Hello Felipe,
      Real Statistics doesn’t have this support, but I will consider it in the future.
      If you know the correct scale parameter this is quite easy.
      Charles

      Reply
  4. Hi Charles,
    For finding the optimal value of ‘k’ based on elbow method [ using SST = SSW+SSB]; we can look at the graph of SSW vs k.
    Would you please share the formulas for computing SST, SSW & SSB? somewhere, I saw a formula like (n-1)*sum of squared differences between obs & global mean.
    I wonder what’s the significance of (n-1) here?

    Reply
  5. Hello, downloaded and installed your add in for excel. Wondering if you have or can provide links to any examples of others leveraging this tool for kmeans cluster analysis? I am a somewhat knew and thought a tutorial/video might help me leverage the technical capabilities of your add in. Thanks!

    Reply
    • Alicia,
      Sorry, but I don’t know of a tutorial/video for this. I plan to add some youtube videos on various subjects as soon as I finish the next software release and publish a book on statistics.
      Charles

      Reply
    • Ron,
      I believe that Ward clustering is a type of hierarchical clustering. I don’t support this as of yet. I plan to eventually add this type of support, but have no specific date.
      Charles

      Reply
  6. Hi,
    How can I find the main variable that that make the difference between the clusters.
    i.e., if I’ll ask someone only these question I’ll know to associate him to cluster .
    Thank you

    Reply
    • Sarah,
      Sorry, but I don’t understand your question. Are you asking “once I have built the model, “how can I assign the appropriate cluster to a new value?”
      Charles

      Reply
      • Hi,
        Yes, after I have the clusters based on 25 questions I want to assign new questionnaires to the clusters by using less questions – only the questions that make the main difference between the clusters.
        Thanx

        Reply
        • Sarah,
          Cluster analysis can be used to reduce the number of variables, not necessarily by the number of questions. I guess you can use cluster analysis to determine groupings of questions. You can then try to use this information to reduce the number of questions.
          Charles

          Reply
  7. Respected Sir,
    Can I fit a double term exponential model in excel? I wish to have the equation
    Y = a*exp(-bx)+c*exp(-dx).
    If not could you please suggest me some free software to fit the model?

    Reply
    • Sam,
      Sorry, but the current Mac version of the software doesn-t include the cluster analysis tool. THis is currently only available on the Windows version.
      Charles

      Reply
  8. Hi Charles

    Thanks so much for these resources!

    I am using the Excel for Mac 2011 plugin and I see that the Cluster Analysis option doesn’t appear on the dialogue box – the only one that is missing 🙁

    Reply
    • Llew,
      Yes, Cluster Analysis is not yet in the latest Mac release of the Real Statistics software, although it is in the Windows releases of the software. It will be part of the next Mac release of the software.
      Charles

      Reply
      • I have Excel 2013 and I installed all versions of real statistics (2003, 2007, 2013). I am on windows and neither of these had the “cluster analysis” option.

        Reply

Leave a Comment