Given a data set S, there are many situations where we would like to partition the data set into subsets (called clusters) where the data elements in each cluster are more similar to other data elements in that cluster and less similar to data elements in other clusters. Here “similar” can mean many things. In biology, it might mean that the organisms are genetically similar. In marketing, it might mean that customers are in the same market segment.
Clustering can also be hierarchical, where clustering is done at multiple levels. Here the data set is divided into clusters and these clusters are in turn further divided into more finely granular clusters. The biological classification system (kingdoms, phylum, class, order, family, group, genus, species) is an example of hierarchical clustering.
In this part of the website, we will describe a form of prototype clustering, called k-means clustering, where a prototype member of each cluster is identified (called a centroid) which somehow represents that cluster. The approach we take is that each data element belongs to the cluster whose centroid is nearest to it; i.e. which minimizes the distance between that data element and that cluster’s centroid.
Typically our data elements will be n-tuples. These can be thought of as points in n-space or as n-dimensional vectors. Each dimension can represent some characteristic of the data elements under consideration. Distance will typically be the Euclidean distance (or a weighted version of this distance) as described below. Thus data elements that most share these characteristics will belong to the same cluster.
We also describe a simpler, one-dimensional clustering algorithm called Jenks Natural Breaks.
Topics
- K-means clustering Basic Concepts
- Real Statistics Capabilities
- Choosing the initial clusters (k-means++ algorithm)
- Jenks Natural Breaks
References
Johnson, R. A., Wichern, D. W. (2007) Applied multivariate statistical analysis. 6th Ed. Pearson
https://www.webpages.uidaho.edu/~stevel/519/Applied%20Multivariate%20Statistical%20Analysis%20by%20Johnson%20and%20Wichern.pdf
Rencher, A.C., Christensen, W. F. (2012) Methods of multivariate analysis (3nd Ed). Wiley
Hello Charles,
Is it possible to include data mining methods such as decision trees/apriori in your awesome Add-in?
We also need to know the optimal number of clusters
Boris
Hello Boris,
Thanks for your kind words about the Real Statistics add-in.
I will add these items to the list of potential future enhancements. Regarding your last item, see
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Charles
Hi Charles,
I was expecting the number of clusters to converge. It doesn’t seem to be the case for my dataset, i.e. the optimal number of clusters is the number of data rows.
Is this always the case? How to determine the optimal number of clusters then?
Hello Vincent,
The optimal number of clusters probably depends on how you will use the information. How would you define the optimal number of clusters?
Charles
Hi Charles
Congratulations for developing this tool!
Just wondering if there is simple way to scale data using your package before running k-means? If not could be future upgrade..?
Hello Felipe,
Real Statistics doesn’t have this support, but I will consider it in the future.
If you know the correct scale parameter this is quite easy.
Charles
Hi Charles,
For finding the optimal value of ‘k’ based on elbow method [ using SST = SSW+SSB]; we can look at the graph of SSW vs k.
Would you please share the formulas for computing SST, SSW & SSB? somewhere, I saw a formula like (n-1)*sum of squared differences between obs & global mean.
I wonder what’s the significance of (n-1) here?
I am familiar with SST, SSW and SSB for ANOVA, but I don’t know what these would mean for Cluster Analysis, although I have used SSE on the following webpage: https://real-statistics.com/multivariate-statistics/cluster-analysis/k-means-cluster-analysis/
Charles
Hello, downloaded and installed your add in for excel. Wondering if you have or can provide links to any examples of others leveraging this tool for kmeans cluster analysis? I am a somewhat knew and thought a tutorial/video might help me leverage the technical capabilities of your add in. Thanks!
Alicia,
Sorry, but I don’t know of a tutorial/video for this. I plan to add some youtube videos on various subjects as soon as I finish the next software release and publish a book on statistics.
Charles
Do you support the Ward clustering method?
Ron,
I believe that Ward clustering is a type of hierarchical clustering. I don’t support this as of yet. I plan to eventually add this type of support, but have no specific date.
Charles
Hi,
How can I find the main variable that that make the difference between the clusters.
i.e., if I’ll ask someone only these question I’ll know to associate him to cluster .
Thank you
Sarah,
Sorry, but I don’t understand your question. Are you asking “once I have built the model, “how can I assign the appropriate cluster to a new value?”
Charles
Hi,
Yes, after I have the clusters based on 25 questions I want to assign new questionnaires to the clusters by using less questions – only the questions that make the main difference between the clusters.
Thanx
Hi Charles,
Is there a problem with my question or with my clarification?
Thank you for your great tool and your help.
Sarah
Sarah,
Cluster analysis can be used to reduce the number of variables, not necessarily by the number of questions. I guess you can use cluster analysis to determine groupings of questions. You can then try to use this information to reduce the number of questions.
Charles
Respected Sir,
Can I fit a double term exponential model in excel? I wish to have the equation
Y = a*exp(-bx)+c*exp(-dx).
If not could you please suggest me some free software to fit the model?
Avijit,
You can use Excel’s Solver to find the parameters a, b, c and d which create a fit for your data. The approach is illustrated on the following webpage
Exponential Regression using Solver
Charles
hey I have excel 2016 for mac where can I find the cluster analysis tool?
Sam,
Sorry, but the current Mac version of the software doesn-t include the cluster analysis tool. THis is currently only available on the Windows version.
Charles
Hi Charles
Thanks so much for these resources!
I am using the Excel for Mac 2011 plugin and I see that the Cluster Analysis option doesn’t appear on the dialogue box – the only one that is missing 🙁
Llew,
Yes, Cluster Analysis is not yet in the latest Mac release of the Real Statistics software, although it is in the Windows releases of the software. It will be part of the next Mac release of the software.
Charles
I have Excel 2013 and I installed all versions of real statistics (2003, 2007, 2013). I am on windows and neither of these had the “cluster analysis” option.
First choose the Multivariate Analysis option. Cluster Analysis will appear on the dialog box that is displayed.
Charles