I am pleased to announce Release 7.5 of the Real Statistics Resource Pack. The new release is now available for free download at Download Resource Pack for Excel 2010, 2013, 2016, 2019 and 365 in Windows environments. I expect to release the Mac version by tomorrow.
I recognize that I haven’t yet finished updating the Real Statistics website and examples workbooks for compatibility with the previous release. Over the course of the next few days, the website and examples workbooks will be updated for compatibility with both the 7.4 and 7.5 releases.
If you are getting value from the Real Statistics website or software, I would appreciate your donation to help offset the costs of the website by going to Please Donate.
The following is an overview of the new features in Release 7.5:
Lp norm enhancements
In Release 7.4, Real Statistics added support for the Minkowski distance and Lp estimates of central tendency. This capability has been extended to weighted versions of Minkowski distance and Lp estimates of central tendency. In particular, the LpEST and LpNORM functions now take the following form:
LpEST(R1, p, Rw, iter, guess, check) = Lp,W(X) where X = the data in the column array R1 and W = the weights in the column array Rw.
Rw has the same number of rows as R1, If Rw is omitted, then the unweighted Lp estimate is computed.
Here, 1 ≤ p ≤ 2, iter = the number of iterations (default 100) used to find the value of y that minimizes ||X–y||p (or the weighted version if Rw is not omitted) and guess is an optional initial guess of this value. If check = TRUE (default FALSE), then this function returns a 2 × 1 array whose second value should be near zero as a check that the iteration has converged.
LpNORM(R1, R2, p, Rw) = the weighted Lp (Minkowski) distance between the vectors in the column arrays R1 and R2 based on the weights in the column array Rw
If Rw is omitted, the unweighted distance is computed. If R2 is a scalar then R2 is treated as a column array all of whose values are this scalar.
Weighted mean and median
The MED function has been added to support the weighted median and the MEAN function has been enhanced to support the weighted mean. These functions take the form:
MEAN(R1, Rw) = weighted mean of the data in array R1 based on the weights in Rw
MED(R1, Rw) = weighted median of the data in array R1 based on the weights in Rw
If Rw is omitted, then the unweighted statistic is returned; otherwise, Rw must have the same size and shape as R1.
Cluster Analysis enhancements
The existing cluster functions are based on using Euclidean distance; i.e. the Minkowski distance where p = 2. In this release, Minkowski distances where p is not necessarily 2 are also supported. Also, weighted-distances are now supported. For data that consist of n-tuples, a fixed column array of weights of size n is used.
The following are the new versions of the cluster analysis functions where p defaults to 2 and Rw is an optional n × 1 column array of weights.
CLUST(R1, k, R2, p, Rw, iter) = m × 1 column array of cluster numbers 1, 2, …, k calculated by the k-means algorithm for the data consisting of n-tuples in the m × n array R1 where R2 is an m × 1 column array containing the initial cluster number assignments. If R2 is omitted then the k-means++ algorithm is used to calculate the initial cluster number assignments. iter = the maximum number of iterations of the algorithm that are performed (default 200).
ClustAnal(R1, k, nreps, p, Rw, iter) = the m × 1 column array of cluster numbers produced by CLUST(R1, k R2, p, Rw, iter) with the lowest error after nreps repetitions of k-means++ algorithm). If nreps = 0 then the m × 1 column array of initial cluster numbers based on the k-means++ algorithm is returned.
CLUSTERS(R1, R2, p, Rw) = m × 1 column array of cluster numbers 1, 2, …, k corresponding to the centroids described in the n × k array R2 which are closest (based on the weighted Minkowski distance defined by p and Rw) to the respective data element in the m × n range R1.
The following new non-array functions have also been added:
CLUSTErr(R1, R2, k, p, Rw) = error statistic for the data in the m × n array R1 based on the cluster assignment in the m × 1 column array R2 based on k clusters 1, 2, …, k. If k = 0 or is omitted then k is set to the largest value in R2.
CENTROIDErr(R1, R2, p, Rw) = error statistic for the data in the m × n array R1 based on the centroids specified in the n × k array R2
CLUST_Converge(R1, R2, p, Rw) = True if the m × 1 column array R2 of cluster numbers 1, 2, …, k calculated by the k-means algorithm for the m n-tuple data elements in the m × n range R1 has converged (i.e. one additional iteration of the algorithm does not result in any changes to the cluster assignments.
The CLUSTErr and CENTROIDErr functions replace the existing SSE_CLUSTERS and SSE_CENTRIODS functions. The error statistic is the sum of the pth powers of the Minkowski distances between each of the data elements and its closest centroid. When p = 2 this is the SSE statistic previously used.
Eigenvalues and eigenvectors
In Release 7.4, a number of new functions for calculating eigenvalues and eigenvectors were announced without giving any details. In the end, I decided to use a different approach. The following new functions have been added that provide improved accuracy over the existing eVALUES, eVAL, eVECTORS and eVECT functions. In fact, these legacy functions can still be used, although they have been reimplemented using the new eigVALSym and eigVECTSym functions described below.
eigVAL(R1, order, check, iter, prec): returns a 3 × n array, where n = the number of rows/columns in the square array R1. The first two rows of the output consist of the real and imaginary parts of the n eigenvalues of the square matrix A corresponding to the data in R1. The third row of the output consists of the values det(A−λI) for each eigenvalue λ.
eigVECT(R1, order, check, iter, prec): returns an n+4 × n array, where n = the number of rows/columns in the square array R1. The first two rows of the output consist of the real and imaginary parts of the n eigenvalues of the square matrix A corresponding to the data in R1. Below each real eigenvalue λ in the first two rows (i.e. in the columns where the second row is zero) is an n × 1 unit eigenvector corresponding to λ. In the second-to-last row of the output are the values det(A−λI). In the last row of the output, below each real eigenvalue λ and eigenvector X is the value max {bi: i = 1 to n} where B = AX− λX.
If order = TRUE (default) then the eigenvalues are arranged in descending order based on their absolute values of their real part; if order = FALSE then the eigenvalues are arranged in descending order of their real parts.
If check = TRUE (default), then the output is as described above, if check = FALSE then only the first two rows of the output are returned for eigVAL and the last two rows of the output are not returned for eigVECT.
If iter > 0 then an additional binary divide-and-conquer algorithm is applied to the real eigenvalues to improve their precision (i.e. potentially reduce the values in the last row of eigVAL or second to last row of eigVECT). iter = the maximum number of iterations of this algorithm (default 200) and prec is the desired precision (default .00000001)
The following related array functions have also been added. These support symmetric square matrices.
eigVALSym(R1, order, check, iter, prec)
eigVECTSym(R1, order, check, iter, prec)
The output is the same as for eigVAL and eigVECT except that the second row of the output (containing the imaginary part of the eigenvectors) is omitted. Symmetric matrices can only have real eigenvalues and eigenvectors, and so the imaginary part of each eigenvalue is zero.
Actually, eigVALSym and eigVECTSym can be used even with nonsymmetric matrices. If such a matrix has a non-real eigenvalue, then the first row of the output will contain the value “imag” for each non-real eigenvalue. Since such eigenvalues occur in pairs (λ = u±vi), you should expect an even number of such entries.
The eigVAL and eigVECT functions are now used in the Matrix Operations data analysis tool.
LU factorization
The following array functions have been added to support the LU decomposition (aka LU factorization) of a square matrix. Actually, these functions support the LUP decomposition where the P represents the permutation matrix.
LUPFactorP(R1): outputs the n × n permutation matrix of the LUP decomposition of the n × n matrix in array R1. This matrix describes the permuted order of the rows in R1 required to obtain the LU decomposition. This matrix is characterized by having only zero and one values and exactly one non-zero value in every row and column.
LUPFactorL(R1): outputs the n × n lower triangular matrix of the LUP decomposition of the n × n matrix in array R1.
LUPFactorU(R1): outputs the n × n upper triangular matrix of the LUP decomposition of the n × n matrix in array R1.
LUPFactor(R1): outputs a 2n+1 × n array consisting of the components of the LUP decomposition of the n × n matrix in array R1. The first n rows of the output consist of a lower triangular matrix L, the next n rows consist of an upper triangular matrix U and the last row defines the permutation of the rows of R1 required to create the matrix P such A = PTLU where A is the matrix in R1.
In addition, a LU factorization option has been added to the Matrix Operations data analysis tool.
Finally, the following functions have also been added:
MPERM(R1): outputs the permutation matrix defined by the permutations listed in the row array R1. E.g. if R1 contains the values 4, 1, 3, 2 then the output will be a 4 × 4 matrix whose only non-zero entries are ones in the 4th column of row 1, the 1st column of row 2, the 3rd column of row 3 and the 2nd column of row 4.
LUDet(R1) = the determinant of the square matrix in R1 using the LU decomposition
LUInverse(R1): outputs the inverse of the square matrix in R1 using the LU decomposition
LUSolve(R1, R2): outputs the solution to the series of linear equations AX = C where R1 contains matrix A and R2 contains the column array C; an error value is returned when there is no solution or there is no unique solution
Solving a system of linear equations
Real Statistics already provides capabilities for solving a system of linear equations AX = C, especially when there is a unique solution (e.g. the LUSolve function described above). In fact, the standard Excel formula =MMULT(MINVERSE(A),C) can be used.
The following array function has been added which returns all the solutions to AX = C even when there isn’t a unique solution. Here R1 contains an augmented m × n matrix (i.e. the A matrix is found in the first n-1 columns of R1 and the last column of R1 contains the C vector.
LinEqSolve(R1, prec): outputs an n × k array with k ≤ n where if X1, …, Xk are the columns in the output, then all the solutions to AX = C take the form X = c1X1 + c2X2 + … + ck-1Xk-1 + Xk for any choice of constants c1, …, ck-1.
Values less than or equal to prec (default 0.0001) are treated as if they were zero.
Creating an orthonormal basis using Gram-Schmidt
Suppose the columns of the m × k array R1 consist of vectors that are linearly independent. We know that we can use the Gram-Schmidt algorithm to find an orthonormal basis for the span of these vectors. In fact, for n such that m ≥ n ≥ k we can find an orthonormal basis with n vectors for a span that includes the span of the original k vectors.
This functionality has been added to Real Statistics via the following function:
GRAM(R1, n, prec): outputs an m × n array whose columns form an orthonormal basis whose span includes the span of the columns in R1. In executing the algorithm, values less than or equal to prec are considered to be equivalent to zero (default 0.0000001)
SVD decomposition and pseudo-inverse of a matrix
The SVD_U function now takes the form SVD_U(R1, iter, prec) where iter (default 200) and prec (default 0.00000001) are as for the eigVECTSym function. SVD_D and SVD_V take the same arguments.
A pseudo-inverse of a matrix is an extension of the inverse of a square matrix to non-square matrices. The following array function has been added:
PseudoInv(R1, iter, prec): outputs an array that is the pseudo-inverse of the matrix in array R1.
This function uses the SVD to calculate the pseudo-inverse. iter (default 200) and prec (default 0.00000001) are as for the SVD_U function.
The pseudo-inverse of a m × n matrix A can also be defined as (ATA)-1AT when m > n and by AT(AAT)-1 when m < n. Real Statistics now supports this approach via the formula PseudoInv(R1,0).
Convert a number to a letter grade
The following function, which is the complement to GRADEVAL, has been added:
VALGRADE(grade, pm, skinny) = a letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F) is returned based on the numeric score specified by grade (which usually takes a value between 0 and 4.5). pm (default 1/3) is used to determines how the plus and minus symbols are assigned to the letter grade. If skinny = TRUE (default FALSE) then the A+ and D- grades are not used (replaced by A and D).
E.g. suppose that pm = .2, then VALGRADE(grade) = “B” for 2.8 ≤ grade < 3.2, VALGRADE(grade) = “B+” for 3.2 ≤ grade < 3.5 and VALGRADE(grade) = “B-” for 2.5 ≤ grade < 2.8.
Dr. good mornig, thanks a lot for your contributions.