Real Statistics Release 6.2

I am pleased to announce Release 6.2 of the Real Statistics Resource Pack. The new release is now available for free download at Download Resource Pack for Excel 2010, 2013, 2016 Windows users

The version for Excel 2013, 2019 and 365 Mac users will be available shortly.

The Basic, Distribution, Non-parameter and Time Series examples workbooks have also been revised for compatibility with the new release. These are available for free download at Download Examples Workbooks. In the next few days the Multivariate and Regression 1 examples workbooks will also be revised.

Over the course of the next few days, the website will be updated for compatibility with the new releases.

If you are getting value from the Real Statistics website or software, I would appreciate your donations to help offset the costs of the website by going to Please Donate.

The following is a review of the new feature in Release 6.2.

Follow-up Testing after Chi-square Test of Independence

After a significant result from the Chi-square independence test, you can perform follow-up tests to pinpoint the source of the significant result. The approaches used are somewhat similar to the use of contrasts after ANOVA. The following array function is now available to assist in post-hoc testing.

POST_CHISQ(R1, srow, scol, lab): returns a column array with the chi-square statistic, p-value and Cramer’s V for the post-hoc test on the contingency table in R1 based on the arguments srow and scol. If lab = TRUE (default FALSE) then a column of labels is appended to the output.

Here, R1 defines a contingency table without headings or totals and srows and scols describe which rows and columns to delete from R1 or combine. E.g. the array formula =POST_CHISQ(A1:C4, “1,2”,”-3″) performs a chi-square test of independence on the 4 × 3 contingency table in A1:C4 with the third column dropped and the first two rows merged (resulting in a 3 × 2 contingency table).

The array formula =POST_CHISQ(A1:C4, “-1,3,4”) performs a chi-square test on the contingency table in A1:C4 with the first row dropped and the last two rows merged (resulting in a 2 × 3 contingency table).

POST_CHIMAX(R1, srow, scol, lab): performs the same test as POST_CHISQ except that now the maximum likelihood version of the chi-square test is used instead of the usual chi-square test.

COMPACT_TABLE(R1, srow, scol, head): outputs an array containing the contingency table that results from the contingency table in R1 based on srow and scol, as described above. If head = TRUE (default) then it is assumed that R1 includes row and column headings and these are also used in the output.

Another approach to post-hoc testing is to determine which cells are playing the biggest and smallest role in the independence test. This is done by calculating the standard residuals of each cell (similar to a z-score). Cells which have a standard residual whose absolute value is larger than 1.96 can be viewed as significant (for alpha = .05). A similar, but more refined approach uses adjusted residuals.

StdRes(R1, head): returns an array of the same size and shape as the contingency table in R1; each cell in the output contains the standard residual for the corresponding cell in R1; If head = TRUE (default FALSE), then both R1 and the output contain row and column headings.

AdjRes(R1, head): just like StdRes except that the adjusted residuals are returned instead of the standard residuals.

StdResTest(R1, head): just like StdRes except that the p-values for the standard residuals are returned instead of the standard residuals.

AdjResTest(R1, head): just like AdjRes except that the p-values for the adjusted residuals are returned instead of the adjusted residuals.

M-estimators

As we have seen the median is a more robust measure of central tendency than the mean since it is less vulnerable to outliers. In this release, we add support for two other robust estimators: Tukey’s biweight estimator and Huber’s estimator.

BIWEIGHT(R1, iter, prec) = Tukey’s biweight estimator for the data in R1 based on a maximum of iter iterations with precision prec

HUBER(R1, iter, prec) = Huber’s estimator for the data in R1 based on a maximum of iter iterations with precision prec

The default for iter is 50 and the default for prec is 0.00000001.

Spline Fitting and Interpolation

The Real Statistics Resource Pack supports three forms of interpolation: linear, harmonic and log. With this release spline interpolation is also supported. This is a curve fitting capability that can also be useful when there is missing  data, especially in time series.

Specifically, the following new array function is now available. Here, Rx and Ry are column arrays with the same number of elements that specify (x, y) values. The objective is to fit a smooth curve that passes through the (x, y) values in Rx and Ry. Rx0 is a column array (or cell or scalar) and we seek the corresponding y value for every x value in the column array Rx0.

SPLINE(Rx0, Rx, Ry): returns a column array with the same number of elements as Rx0; for each x value in Rx0, the output array from this function contains the y value corresponding to x such that (x, y) lies on the spline curve.

Handling Missing Time Series Data

The following new array function is now available for imputing missing data in a time series. Here, R1 is a column array containing a time series; any non-numeric elements in R1, including blanks, are considered to represent a missing data element.

TSImputed(R1, itype, k, stype, per): returns a time series,in the form of a column array of the same size as R1 where any missing elements are replaced by an imputed value.

itype = the numeric label or text label corresponding to one of the seven imputation types listed below (default is 0 for the linear interpolation type).

Numeric label Text label Imputation type
0 linear linear interpolation
1 spline spline interpolation
2 prior use prior value
3 next use next value
-1 sma simple moving average
-2 wma weighted moving average
-3 ema exponential moving average

k (default 2) is an argument that is used with the sma, wma and ema options and specifies the number of elements from R1 used to impute a missing value. per is the seasonality period (default 4). stype is the seasonality type with values given in the following table (default 0 for none).

Numeric label Text label Seasonality type
0 none no seasonality
1 seas seasonal imputation
2 split split seasons

The Two-way MANOVA

A new Two-way MANOVA data analysis tool has been added to support MANOVA and Repeated Measures MANOVA with two factors. This can be considered to be an extension of the existing one-way MANOVA data analysis tool to support two factors and their interaction or as an extension to the existing two-way ANOVA data analysis tool to support multiple dependent variables.

In addition to the new data analysis tool, the following array functions have been added. In each of the following, R1 is a cell range containing the data being analysed. The first column contains identifiers for one of the factors, the second column contains identifiers for the other factor and the remaining columns contain the values for the dependent variables. Note that R1 uses the same format as for one-way MANOVA except that there are now two columns for the factor identifiers instead of one.

Extract2Cov(R1, s1, s2) = covariance matrix for all the elements in R1 that match s1 and s2 in the first two columns; if s1 = “” then the pooled covariance matrix is returned.

COV2Pooled(R1) = pooled covariance matrix based on the data in R1.

Extract2Rows(R1, s1, s2, head) = an array containing all the rows in R1 that match s1 and s2 in the first two columns; if head = TRUE (default) then it is assumed that the first row in R1 contains column headings (which are included in the output)

MANOVA2_Tot(R1) = the Total (T) matrix based on the data in R1

MANOVA2_Res(R1) = the Residual (E) matrix (aka error matrix or within subjects matrix) based on the data in R1

MANOVA2_Row(R1) = the H (hypothesis) matrix for the first factor (which we call the row factor or factor A) based on the data in R1

MANOV2_Col(R1) = the H matrix for the second factor (which we call the column factor or factor B) based on the data in R1

MANOVA2_Int(R1) = the H matrix for the interaction between the two factors (which we call the interaction factor or factor AB) based on the data in R1

In addition, there are the following new array functions which return a 6 × 1 column array containing the following values: the appropriate test statistic (as described below), df1, df2, F-stat, p-value and partial eta-square effect size. If lab = TRUE (default FALSE) then an extra column of labels is appended to the output.

    • MANOVA2_WilksRow(R1, lab): uses Wilks test statistic on the column 1 factor
    • MANOVA2_WilksCol(R1, lab): uses Wilks test statistic on the column 2 factor
    • MANOVA2_WilksInt(R1, lab): uses Wilks test statistic on the interaction between the two factors
    • MANOVA2_PillaiRow(R1, lab): uses Pillai test statistic on the column 1 factor
    • MANOVA2_PillaiCol(R1, lab): uses Pillai test statistic on the column 2 factor
    • MANOVA2_PillaiInt(R1, lab): uses Pillai test statistic on the interaction between the two factors
    • MANOVA2_HotelRow(R1, lab): uses Hotelling test statistic on the column 1 factor
    • MANOVA2_HotelCol(R1, lab): uses Hotelling test statistic on the column 2 factor
    • MANOVA2_HotelInt(R1, lab): uses Hotelling test statistic on the interaction between the two factors

The following new array function returns a 5 × 1 column array containing the following values: M-statistic df1, df2, F-stat and p-value. If lab = TRUE (default FALSE) then an extra column of labels is appended to the output.

MANOVA2_BOX(R1, lab): performs a Box M test for equality of the covariance matrices that correspond to the interactions between the two factors in R1.

There are the following non-array functions. If b = TRUE (default) the largest eigenvalue, lambda, for the appropriate H covariance matrix is returned; otherwise 1/(1+lambda) is returned.

    • MANOVA2_RoyRow(R1, b) = Roy Root for the column 1 factor
    • MANOVA2_RoyCol(R1, b) = Roy Root for the column 2 factor
    • MANOVA2_RoyInt(R1, b) = Roy Root for the interaction factor

Except for Extract2Rows, it is assumed that R1 does not contain column headings for any of these functions.

Multivariate Outliers

The existing MOUTLIERS array function, which calculates the Mahalanobis distance squared and p-value for all the vectors in a multivariate sample, has been enhanced with the addition of a second argument. The function is now defined as follows:

MOUTLIERS(R1, alpha): returns all the Mahalanobis distance squared and p-value only for those vectors in sample R1 with p-value < alpha (presumed outliers); if alpha = 0 (default) then the output contains the Mahalanobis distance squared and p-value for all vectors in R1.

Gini Coefficient

The following new functions have been added to calculate the Gini coefficient:

GINI(R1) = the Gini coefficient for the data in R1

GiniTest(R1, lab, iter, alpha): returns a column array with the following elements: Gini coefficient for the data in R1, as for GINI(R1), the mean Gini coefficient based on iter many bootstraps (default 2000), the standard error of these bootstraps as well as the lower and upper limits of a 1–alpha confidence interval based on the bootstraps (alpha defaults to .05).

There is also an unbiased version of the Gini test function, defined as follows. For this function the 1–alpha confidence interval is not symmetric around the estimate of the population Gini coefficient. As a result the end points of the confidence interval are not at 1–alpha/2 and 1+alpha/2. Instead these end point percentages are specified in the output.

GiniUnbiased(R1, lab, iter, alpha): returns a column array with the same elements as GiniTest except that the percentages corresponding to the end points of the confidence interval are included in the output.

If lab = TRUE (default FALSE) then an extra column of labels is appended to the output of the two array functions described above.

SARIMA data analysis tool

The SARIMA data analysis tool has been enhanced with the inclusion of standard errors t statistics and p-values for the coefficients. These new values are based on the following new functions:

SARIMA_COEFF(R1, ar, ma, diff, per, sar, sma, sdiff, con, lab): returns an array with two columns, the first column of which contains the SARIMA coefficients (in the order constant term, phi coefficients, theta coefficients, Phi coefficients, Theta coefficients) and the second column contains the corresponding standard errors. If lab = TRUE (default FALSE) then a column of labels is appended to the output.

SARIMA_PARAM(R1, ar, ma, diff, per, sar, sma, sdiff, con): returns an array with four columns, the first column of which contains the SARIMA coefficients (in the order constant term, phi coefficients, theta coefficients, Phi coefficients, Theta coefficients) and the remaining columns contain the corresponding standard errors, t statistics and p-values.

Here, the parameters are ar = p, ma = q, diff = d, per = m, sar = P, sma = Q, sdiff = D for a (p, d, q) × (P, D, Q)m SARIMA model. con = TRUE (default) if a constant term is included in the model.

Big thanks to Milos Cipovic who developed this enhancement.

ARIMAX

A new ARIMAX data analysis tool has been added to provide ARIMA with exogenous variables support.

New statistical power and minimum sample size capabilities

The Statistical Power and Sample Size data analysis tool has been updated to support the one and paired sample Hotelling’s T-square test, the two independent sample Hotelling’s T-square test, as well as one-way MANOVA.

In addition, the following six functions have been added:

Hotel1_POWER(d, n, k, ttype, alpha, iter, prec) = the statistical power for a one-sample or paired Hotelling’s T-square test where the sample size is n, the number of dependent variables is k and the effect size is d, where d = the Mahalanobis distance if ttype = 1 and d = the noncentrality parameter if ttype = 0.

Hotel2_POWER(d, n1, n2, k, ttype, alpha, iter, prec) = the statistical power for a two-sample Hotelling’s T-square test where the sample sizes are n1 and n2 and the other arguments are as for Hotel1_POWER.

MANOVA_POWER(f n, k, g, ttype, alpha, iter, prec) = the statistical power for one-way MANOVA where the sample size is n, the number of dependent variables is k , the number of groups is g and the effect size is f, where f = the partial eta-square effect size if ttype = 1, f = eta-square if ttype = 2 and f = Pillai’s V if ttype = 3.

Hotel1_SIZE(d, k, pow, alpha, iter, prec) = the minimum sample size required to obtain the statistical power pow for a one-sample or paired Hotelling’s T-square test where the number of dependent variables is k and the Mahalanobis effect size is d.

Hotel2_SIZE(d, k, pow, alpha, iter, prec) = the minimum size for sample 1 required to obtain statistical power of pow for a two-sample Hotelling’s T-square test where the number of dependent variables is k and the Mahalanobis effect size is d. nratio = the size of sample 2 divided by the size of sample 1 (default 1); if nratio is negative then the size of sample 2 is assumed to be –nratio.

MANOVA_SIZE(f, k, g, pow, ttype, alpha, iter, prec) = the minimum sample size to obtain statistical power of pow for one-way MANOVA where f, k, g and ttype are as for MANOVA_POWER.

alpha is the significance level (default .05), iter = the maximum number of iterations used in calculating the answer (default 1000) up to a precision of prec (default 0.000000001), the default for pow is .80.

In addition there is the following function that calculates the statistical power of Hotelling’s T-square test when the raw data is known. Here, R1 and R2 contain the data and d = 0 for the one sample test, d = 1 for the paired sample test, d = 2 for the two-sample test with equal covariance matrices and d = 3 for the two-sample test with unequal covariance matrices. Thus, R1, R2 and d are as for the Hotelling function.

Hotelling_POWER(R1, R2, d, alpha, iter, prec) = statistical power for the appropriate Hotelling’s T-square test.

Although not yet included in the Power and Sample Size data analysis tool, the following functions calculate the power and sample size for two-way MANOVA.

MANOVA2Row_POWER, MANOVA2Col_POWER, MANOVA2Int_POWER, MANOVA2Row_SIZE,                MANOVA2Col_SIZE, MANOVA2Int_SIZE

These functions take the same arguments as MANOVA_POWER and MANOVA_SIZE, except that there are now two group arguments, g1 (# of groups for the row factor) and g2 (# of groups for the column factor).

Multiple Regression Enhancements

The following functions have been added:

RegCV(Rx, Ry, con) = cross-validation for the OLS regression based on the X data in Rx and Y data in Ry where con = TRUE (default) if the regression includes a constant term.

PredRSquare(Rx, Ry, con) = predictive R-square for the OLS regression based on the X data in Rx and Y data in Ry where con = TRUE (default) if the regression includes a constant term.

Minor Enhancements

  • The BOX function, which is used to conduct Box’s M test for the equality of covariance matrices, has been enhanced by optionally adding a column of labels to the output. The function now takes the form BOX(R1, k, lab).
  • The partial eta-square effect size has now been added to the output from the following functions: MANOVA_Wilks, MANOVA_Pillai and MANOVA_Hotel.

In addition:

  • The QSORT function has been enhanced with the addition of two arguments. The function now takes the form QSORT(R1, ascend, nrows, ncols). This function sorts the data elements in R1 and outputs an nrows × ncols array with these sorted values. If nrows = 0 (the default), then the output fills the highlighted array (as was the case previously), while if nrows < 0 then the output is the same size and shape as R1.
  • The same sort of enhancement is made for the following functions: RANDOMIZE(R1, nrows, ncols), SHUFFLE(R1, filler, nrows, ncols) and REVERSE(R1, filler, nrows, ncols).

Bug Fixes

  • Fixed a bug in the BOX, BOXM, BOXF and BOXTest functions when the second argument, k, is zero. This bug caused the incorrect value to be returned. Note that the case where the k argument was specified did return the correct value.
  • Fixed an error whereby an erroneous ordered chi-square test was reported when the Ordered rows and columns option was not selected in the Chi-square Test of Independence data analysis tool.
  • Corrected the output from the COV_Pooled array function when the covariance matrix was derived from a sample with groups of different sizes.