I am pleased to announce Release 9.4 of the Real Statistics Resource Pack. The new release is now available for free download at Download Resource Pack for Excel 2010, 2013, 2016, 2019, 2021, and 365 Windows and Mac environments.
Over the next few days the website will be updated for compatibility with the new release (now done).
I want to thank everyone who made suggestions or identified errors in the website or software. Your help has improved the utility and accuracy of Real Statistics.
I also want to express my appreciation to all of you who have donated to Real Statistics. These donations help offset the costs of maintaining the website. If you are getting value from the Real Statistic website or software, I would appreciate your donation by going to Please Donate.
The following is an overview of the new features in Release 9.4.
Matching
To determine whether there is a significant difference between the effect on two groups (e.g. a treatment group and a control group), we randomly select a sample from the population being studied and then randomly assign subjects in the sample to the two groups. This is called an experimental study.
Sometimes it is impossible or unethical to randomly assign subjects to the two groups. When there is no random assignment of the sample into groups, then we have an observational study. We now need to take confounding variables explicitly into account. Two approaches for doing this are coarsened exact matching (CEM) and propensity score matching (PSM), in which we identify all important confounding variables, and then determine matching subjects from the sample based on similar confounding variable values, pruning non-matching pairs.
Rel 9.4 provides support for these two approaches.
Propensity Score Matching (PSM)
Rel 9.4 adds a Propensity Score Matching data analysis tool.
This tool works by first performing logistic regression to determine the probability that each sample element belongs to the treatment group. This is done by using the formula =LogitPred(R1, R2, TRUE).
Here R1 is an array where each column contains sample data for one confounding variable. The data should be sorted so that the rows corresponding to the control group occur before the rows containing data from the treatment group. R2 is a column array containing 1 if the corresponding row in R1 comes from the treatment group and 0 if the corresponding row in R1 comes from the control group.
The output from this formula is a column array with the same number of rows as R1 (or R2) containing the probabilities that each row in R1 came from the treatment group based on the logistic regression model (whether or not they actual came from the treatment group).
PSM Matching
To support this data analysis tool, the following worksheet function is used:
PSM_Matching(R0, tindex, cutoff): R0 is an array of probabilities (i.e. the output from the logistic regression formula described above). tindex is the number of the first row in R0 that came from data in the treatment group.
PSM_Matching matches all elements in the treatment part of R0 (i.e. rows starting with tindex) with the nearest element in the control part of R0 (i.e. rows before tindex), provided that distance is not greater than cutoff.
PSM_Matching then returns a column array of zeros and ones with the same number of rows as R0 where 1 = match and 0 = no match.
Pruning
The output from PSM_Matching is now used to eliminate non-matched rows from the original data. This is done by using the following new worksheet function:
Pruning(R1, R2, b): R1 is an array containing data (for confounding variables and the outcome variable) and R2 is the output from PSM_Matching (column array consisting of zeros and ones).
The output consists of the data in R1 where any row with a corresponding value of zero in R2 is deleted. If b = TRUE (default FALSE) then the non-zero values in R2 are appended to the output.
Note that after pruning, the number of retained treatment samples is equal to the number of control samples.
Matching Quality
The following new function is used to determine the efficacy of the PSM matching:
MatchQuality(R1, b): R1 is an array containing data for the treatment variable (with values 0 or 1), any confounding variables and the outcome variable. The first column of R1 contains the data for the treatment variable.
The output consists of statistics for the treatment and control groups, one column for each of the variables in R1 except for the treatment variable. The statistics reported are the mean, variance, min, 25th percentile, median, 75th percentile, and max.
In addition, since b = TRUE (default), the following four statistics are reported for the residuals comparing the treatment values with control values for each of the variables in R1 (except for the treatment variable): p-value of Shapiro-Wilk test for normality, p-value of a two independent sample t-test (assuming unequal variances), p-value of a Mann-Whitney test, and p-value of a Brunner-Munzel test.
If b = FALSE, then the last two rows are not output.
Coarsened Exact Matching
Rel 9.4 adds a Coarsened Exact Matching data analysis tool.
This tool works by first creating bins (as for a histogram) for each of the confounding variables. This is done by using the following new worksheet function:
CEM_Coding(R1, R2): R1 is an array whose first column contains zeros and ones where 1 corresponds to the treatment group and 0 corresponds to the control group. The other columns contains sample data for one confounding variable. The data has been sorted so that the rows corresponding to the control group occur before the rows containing data from the treatment group. R2 is an array that contains codes that fix the bins.
The output contains the coding for all the values in R1, excluding the first column, based on the coding in R2. This output is then sorted eliminating duplicates, using the existing array function SortsRowsUnique, to obtain a set of patterns (rows in the output) called bin signatures.
Rows in the array of coarsened data are now matched against the array of bin signatures using the following new worksheet function.
RowMatch(R1, R2): R1 is a numeric array (for CEM, this is the array of bin signatures). R2 is a numeric row array (for CEM this is a potential bin signature).
Weights are then calculated for each bin signature and these are then used to provide weights for each row of the original data R1, forming a column array R2. Rows in R1 with a zero weight are then pruned using the =Pruning(R1, R2, TRUE) formula, as described above. Finally, the quality of the matching is ascertained using the =MatchQuality(Ro, FALSE) formula where R0 is the output from pruning.
The number of control samples won’t usually be equal to the number of treatment samples. This difference is captured via the weights.
1 to 1 pairing of treatment and control elements
The Coarsened Exact Matching data analysis tool has a 1 to 1 pairing of treatment and control elements option. If this option is selected, then the number of matched treatment samples is equal to the number of matched control samples. When there is an imbalance for any bin signature, then samples from the larger group for that bin signature are removed. This is accomplished using the following worksheet function:
CEM_Pairing(R1, R2): R1 is a column array containing the bin signature numbers for the rows of coarsened data and R2 is an array with two columns, where the first column contains the # of treatment data samples for the corresponding bin signature and the second column contains the # of control data samples for the corresponding bin signature.
The output is a column array with the same number of rows as the output from the CEM_Coding formula where a value of 0 corresponds to dropping this sample from the matched samples and 1 corresponds to keeping this sample.
In this case, the pruning formula uses FALSE as the third argument.
Distribution Fitting Enhancements
Options for the following distributions have been added to the Distribution Fitting data analysis tool: Poisson, PERT, Triangular, Pareto, GEV, Cauchy, Binomial, and Negative Binomial distributions.
An error in the calculation of the MLE for the Log Normal distribution has also been fixed.
The following worksheet functions have also been added to output the parameters that provide the best fit for implied distribution using maximum log-likelihood estimation (MLE):
NORM_FIT(R1, lab, pure)
POISSON_FIT(R1, lab)
BINOM_FIT(R1, lab, ntrials)
The following worksheet function supports distribution fitting using the method of moments approach:
LOGNORM_FITM(R1, lab, pure)
An error in the GEOM_FIT and LOGNORM_FIT worksheet functions have been fixed and a revised version of the PARETO_FITM worksheet function has also been provided.
Order Statistic Enhancements
The following function provides an estimate of the variance of the kth order statistic for a sample of size n from a distribution specified by dist. It plays a similar role to ORDER_MEAN which calculates the mean of the kth order statistic.
ORDER_VAR(k, n iter, dist, param1, param2, param3)
The existing ORDER_INV worksheet function has been improved and made faster.
New Geometric Distribution Functions
The following worksheet functions has been added, along with its standard Excel representation:
GEOM_DIST(x, p, cum) = NEGBINOM.DIST(x, 1, p, cum)
Also added is
GEOM_INV(p, pp) = NEGBINOM_INV(p, 1, pp)
New Reformatting Capabilities
The following worksheet function has been added:
DELCOL(R1, ncol): returns an array just like the R1 array but with the ncol-th column removed.
The existing COUNTS_IF(R1), SUMS_IF(R1, R2), and AVGS_IF(R1, R2) can now be used when R1 contains non-numeric values.
Bug Fix
Fixed a bug in the POISSON_INV function that sometimes returned a value one unit too high.
Dr muchas gracias, many thanks, I am promoving a lot of videos, using real Statitistics.
Thank you, Gerardo. I appreciate it.
Charles