Real Statistics Resource Pack Release 9.3

I am pleased to announce Release 9.3 of the Real Statistics Resource Pack. The new release is now available for free download at Download Resource Pack for Excel 2010, 2013, 2016, 2019, 2021, and 365 Windows and Mac environments. 

In the next few days the website will be updated for compatibility with the new release.

I want to thank everyone who has made suggestions or has identified errors in the website or software. Your help has improved the utility and accuracy of Real Statistics.

I also want to express my appreciation to all of you who have donated to Real Statistics. These donations help to offset the costs of maintaining the website. If you are getting value from the Real Statistics website or software, I would appreciate your donation by going to Please Donate.

The following is an overview of the new features in Release 9.3. 

Gaussian mixture models (GMM)

GMM can be viewed as a model that represents normally distributed subpopulations where each population may have different parameters (mean and variance) and for each data value in the population we don’t know a priori which subpopulation that data element belongs to.

We can view GMM as a machine learning method that uses unsupervised learning. 

Data analysis tool

Rel 9.3 adds a GMM Cluster Analysis data analysis tool to the Multivar tab which uses GMM to identify k clusters (the subpopulations) that best fit data consisting of vectors of equal dimension.

Worksheet functions (multivariate models)

In addition, the following new worksheet functions have been added to support data consisting of vectors of dimension d > 1 (multivariate case) based on the sample in the n × d array R1.

MultiGMM(R1, k, …): outputs a GMM model which for each cluster 1, 2, …, k, estimates the multivariate normal distribution for that cluster (consisting of the mean vector and covariance matrix) as well as the probability that any data vector best fits into that particular cluster.

MultiGMMClust(R1, R2, sorted): outputs an n × d+k+1 array whose first d columns consists of the data in R1, and whose next k columns show the probability that each data vector in R1 belongs to each one of the k clusters based on the GMM model described in R2. Finally, the last column of the output shows the cluster with the highest probability for each data vector in R1. If sorted = TRUE then the output is sorted by the cluster number in the last column; otherwise (default) the order is the same as in R1.

The GMM model is built using a EM approach, and is initialized by user-supplied cluster assignments or via clusters assigned by K-means clustering.

Worksheet functions (univariate models)

In the case where the dimension d = 1, the new release provides univariate versions of the above two worksheet functions, namely UniGMM and UniGMMClust.

This version of the model provides additional initialization options.

Effect size for t test with unequal variances

The new release provides the new worksheet functions T_EFFECT3 and TT_EFFECT3 for calculating Cohen’s d* and Hedges g* effect sizes for a two independent sample t-test where the variances are not assumed to be equal.

Combined scatter plot

A new Combined Scatter Plot data analysis tool has been added to the Desc tab that displays multiple scatter plots on the same chart. The points from each plot have different colors.

This data analysis tool can be used with the existing K-means Cluster Analysis tool as well as the new GMM Cluster Analysis tool.

New count worksheet functions

The following new worksheet functions have been added where R1 is a column array or range.

COUNT_IF(R1, s) = # of times s appears in R1

COUNTS_IF(R1): returns an k × 2 array where the first column consists of all the unique values in R1 in sorted order and the second column consists of the values =COUNT_IF(R1, s) where s is the corresponding value in the first column

When R1 is a range, =COUNT_IF(R1,s) is equivalent to the standard Excel formula =COUNTIF(R1, s). COUNTIF requires that R1 be a range, whereas COUNT_IF does not.

New sum and average worksheet functions

The following new worksheet functions have been added where R1 is an m × 1 column array or range, and R2 is an m × d array or range.

SUM_IF(R1, s, R2): returns a 1 × d row array (or numerical element if d = 1), whose jth element is equal to the sum of all values in the jth column of R2 whose corresponding row in R1 contains s.

SUMS_IF(R1, R2): returns an k × d+1 array where the first column consists of all the unique values in R1 in sorted order and the next d columns consist of the values =SUM_IF(R1, s, R2) where s is the corresponding value in the first column.

AVG_IF(R1, s, R2): returns a 1 × d row array (or numerical element if d = 1), whose jth element is equal to the average of all values in the jth column of R2 whose corresponding row in R1 contains s.

AVGS_IF(R1, R2): returns an k × d+1 array where the first column consists of all the unique values in R1 in sorted order and the next d columns consist of the values =AVG_IF(R1, s, R2) where s is the corresponding value in the first column.

New covariance worksheet function

The following new worksheet function has been added where R1 is an m × 1 column array or range, and R2 is an m × d array or range.

COV_IF(R1, s, R2): returns a d × d row array with the sample covariance matrix for the data in the matrix consisting of the rows in R2 where the corresponding element in R1 is s.

Note that if d = 1, then the output from =COV_IF(R1, s, R2) is the variance of the data in the matrix consisting of the elements in R2 whose corresponding element in R1 is s.

Minor enhancements

  • The existing Cluster Analysis data analysis tool based on K++ means now also handles univariate data
  • The existing MNORMDIST worksheet function now also handles univariate normal distributions
  • A confidence interval for g effect size has now been added to the output from the existing T_EFFECT1,T_EFFECT1, T_EFFECT1, T_EFFECT1,T_EFFECT1, and TT_EFFECT3 worksheet functions.