Real Statistics Multivariate GMM Support

This webpage describes the various worksheet functions and data analysis tools provided by the Real Statistics Resource Pack that support the creation and utilization of multivariate Gaussian Mixture Models (GMM) in Excel. Examples are also provided.

Worksheet functions

The Real Statistics Resource Pack provides the following worksheet functions to support GMM for data consisting of numeric vectors of dimension d > 1 (multivariate case) based on the sample in the n × d array R1.

MultiGMM(R1, k, lab, iter, prec, R0, kiter, nruns): outputs a GMM model which for each cluster 1, 2, …, k, estimates the multivariate normal distribution for that cluster (consisting of the mean vector and covariance matrix) as well as the probability that a data vector best fits that particular cluster.

If lab = TRUE a column of labels is appended to the output (default FALSE). iter = the maximum number of iterations of the EM algorithm (default 20). This algorithm also terminates if the difference in LL value from the previous iteration is less than prec. If prec = 0 (default), then iter iterations are made.

The GMM model is initialized with the cluster values in the × 1 column array R0 which only contains values 1, 2, …, k. If R0 is omitted then the cluster values from a K-means++ cluster algorithm on R1 with kiter iterations (default 10) and nruns runs (default 5) are used.

This function returns a (d+2)k+1 × d array (or one more column if lab = TRUE) with d+2 rows for each cluster (consisting of the weight, mean vector, and covariance matrix). The last row consists of the LL value and the # of iterations (if no convergence after iter iterations, then iter+1 is returned).

Cluster Assignments

MultiGMMClust(R1, R2, sorted): outputs an n+1 × d+k+1 array whose first d columns consists of the data in R1, and whose next k columns show the probability that each data vector in R1 belongs to each one of the k clusters based on the GMM model described in R2. Finally, the last column of the output shows the cluster with the highest probability for each data vector in R1. If sorted = TRUE then the output is sorted by the cluster number in the last column; otherwise (default) the order is the same as in R1. 

The last row of the output consists of the fraction of entries in R1 assigned to each cluster plus the total number of rows in R1.

Data Analysis Tool

The Real Statistics Resource Pack also provides a GMM Cluster Analysis data analysis tool which uses the above two worksheet functions. This tool is described in Example 1.

Example

Example 1: Create a GMM model with 3 components based on the 150 samples of flowers whose length and width are described in range A1:B151 of Figure 1 (only the first 12 of 150 samples is listed). This example is taken from the GeeksforGeeks reference listed below.

GMM data

Figure 1 – Sample data

We do this by using the above-mentioned data analysis tool. First, we press Ctrl-m to bring up the main menu, and select GMM Cluster Analysis from the Multivar tab. Next, we fill in the dialog box box that is displayed as shown in Figure 2.

GMM dialog box

Figure 2 – GMM Cluster Analysis dialog box

Since we have left the Initial Clusters field blank, the clusters will be initialized with the clusters determined by k-means clustering as described in Real Statistics Support for K-means Cluster Analysis.

If the Initial Clusters field is not blank, then it must contain a column range with the same number of rows as the Input Range not counting headings. Each entry must be an integer value between 1 and the value on the Number of Clusters field.

After pressing the OK button the output shown in Figures 3 and 4 are displayed.

Cluster Analysis Summary

A summary of the results is displayed in Figure 3. The values in range D3:F15 are calculated using the array formula =MultiGMM(A2:B151,3,TRUE,40).

GMM analysis part 1

Figure 3 – GMM Cluster Analysis (part 1)

This figure is broken into 4 parts. The first 3 parts represent the 3 clusters. The 4th part shows the log-likelihood LL value for the model and the number of iterations. Since 40 iterations were requested and no precision value was set, this is 41, 1 more than 40.

For example, the first cluster has mean vector (6.09685, 2.895404) and covariance matrix shown in range E5:F6. This is the highest weighted cluster with weight = .426598.

Cluster allocation for sample data

The data analysis tool also displays the cluster allocation for all 150 vectors in the sample, as shown in Figure 4 (only 17 of these are displayed in the figure). The values in range H3:M153 are calculated using the array formula =MultiGMMClust(A2:B151,D3:F15).

GMM cluster allocation

Figure 4 – GMM Cluster Analysis (part 2)

For example, the data vector (5.2, 2.7) in row 62 is allocated to cluster 2 since that is the cluster with the highest probability (72.8% vs. 21.4% for cluster 1 or 0.3% for cluster 3).

Of the 150 vectors in the sample, 19.33% have been allocated to cluster 1, 48% to cluster 2, and 32.67% to cluster 3 (see row 153).

Cluster allocation for new data

As described in GMM Example, the MultiGMMClust function can also be used for cluster allocation of data that is not in the sample.

An example of this is shown in Figure 5 where range R3:W4 contains the array formula =MultiGMMClust(O3:P4,D3:F15). Here, (5.1, 3.5) is in the sample, but (6.0, 3.0) is not in the sample. The first is allocated to cluster 2 and the second is allocated to cluster 3.

GMM cluster allocations

Figure 5 – GMM Cluster Allocations

Compound scatter plot

When using two-dimensional data, you can also create a scatter plot of the data, assigning each point to its cluster. For Example 1, this is done by pressing Ctrl-m and selecting the Compound Scatter Chart data analysis tool from the Desc tab. You then fill in the dialog box that appears as shown in Figure 6.

Compound Scatter Chart dialog

Figure 6 – Compound Scatter Chart dialog box

The Input Range is the range from Figure 1 and the Group Range is the cluster allocation range shown in Figure 4. After pressing the OK button, the chart shown in Figure 7 is displayed.

GMM scatter plot

Figure 7 – Compound Scatter Plot

This chart paints a clearer picture of the allocation of the sample points into 3 clusters (subpopulations).

Note that if the points in the chart are all bunched together, you can use the Change Axes Min/Max data analysis tool to resolve this (see Figure 7 of Confidence Ellipse, for example).

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

McGonagle, J. et al (2024) Gaussian mixture model
https://brilliant.org/wiki/gaussian-mixture-model/

Carrasco, O. C. and Whitfield, B. (2024) Gaussian mixture models explained
https://builtin.com/articles/gaussian-mixture-model

GeeksforGeeks (2023) Gaussian mixture model
https://www.geeksforgeeks.org/gaussian-mixture-model/

Apgar, V. (2023) 3 use-cases for Gaussian Mixture Models (GMM)
https://towardsdatascience.com/3-use-cases-for-gaussian-mixture-model-gmm-72951fcf8363

Leave a Comment