Real Statistics Descriptive Statistics and Reformatting Functions

The following is a summary of worksheet functions provided in the Real Statistics Resource Pack that pertain to descriptive statistics and reformatting of data. Some other miscellaneous functions are also included.

These functions are organized into the following categories:

  • Descriptive statistics
  • Divergence
  • Frequency tables
  • Descriptive statistics for frequency tables
  • Box-Cox transformation
  • Counts
  • Sorting and eliminating duplicates
  • Sampling and reshaping
  • Conditional functions
  • Lambda functions
  • Colors
  • Text strings
  • Categorical coding
  • Miscellaneous functions

Descriptive statistics

IQR(R1, bexc) interquartile range of data in R1; if bexc = TRUE then use QUARTILE while if bexc = FALSE then use QUARTILE.EXC or QUARTILE_EXC
PERCENTILE_EXC(R1, p, m) pth percentile from the data in R1 where = Hyndman-Fan method (default of 6 is equivalent to PERCENTILE.EXC)
QUARTILE_EXC(R1, quart, m) quartth quartile from the data in R1 where = Hyndman-Fan method (default of 6 is equivalent to QUARTILE.EXC)
HD_QUANTILE(R1, quart, m) pth Harrell-Davis quantile for the data in the column array R1
MAD(R1, harrell) median absolute deviation of the data in R1; if harrell = TRUE (default FALSE) then the Harrell-Davis version of the MAD is returned
DoubleMAD(R1, lower, harrell) lower MAD of the data in R1 if lower = TRUE (default); otherwise the upper MAD is returned; if harrell = TRUE (default FALSE) the Harrell-Davis MAD is used.
RNG(R1) range of data in R1, i.e. MAX(R1) – MIN(R1)
SKEWP(R1) population skewness for the data in R1
KURTP(R1, excess) population kurtosis for the data in R1; if excess = TRUE (default) then 3 is subtracted
STDERR(R1) standard error of the data in R1; equivalent to STDEV(R1)/SQRT(COUNT(R1))
RANK_AVG(x, R1, order, num_digits) rank of x in R1 taking ties into account; order = 0 (default) indicates decreasing order and a non-zero value indicates increasing order; equivalent to Excel’s RANK.AVG(x, R1, order) except that the data is first rounded off to num_digit decimal places.
SMALLExact(R1, k) SMALL(R1, k) when k is an integer; otherwise it is the interpolation between SMALL(R1, m) and SMALL(R1, m+1) when m < k < m+1 where m is an integer.
Ties_Correction(R1, R2, type) ties correction value for the data in R1 and optionally R2, where type = 0: one sample, type = 1: paired sample, type = 2: independent samples
WINMEAN(R1, p) Winsorized mean of the data in R1 replacing the lowest and highest p/2 % of the data values.
TRIM_MEAN(R1, p, p1) equivalent to AVERAGE(TRIMDATA(R1, p, p1))
BIWEIGHT(R1, iterprecc, pure) Tukey’s biweight estimator of the data in R1 based on a maximum of iter iterations (default 50) and the selected value for c (default 4.685)
HUBER(R1, iterprecc, pure) Huber’s estimator of the data in R1 based on a maximum of iter iterations (default 50) and the selected value for c (default 1.339)
STANDARD(x, R1, type, bexc) =STANDARDIZE(x, AVERAGE(R1), STDEV(R1) if type = 0 (default); =(x−Q3)/IPR if type = 1 and x > Q3; =(x−Q1)/IPR if type = 1 and x < Q1; = 0 otherwise. IPR = IPR(R1,bexc), Q1 = QUARTILE(R1,1) and Q3 = QUARTILE(R1,3) if bexc = FALSE (default) and Q1 = QUARTILE.EXC(R1,1) and Q3 = QUARTILE.EXC(R1,3) if bexc = FALSE.
AUC_LOWER(aucn1, n2, α) lower limit of the 1–α  confidence interval for the area under the curve = auc for samples of size n1 and n2
AUC_UPPER(aucn1, n2, α) upper limit of the 1–α  confidence interval for the area under the curve = auc for samples of size n1 and n2

Array functions

There are also the following array functions:

SUMCOL(R1) returns a row array with the sums of each column in R1
MEANCOL(R1) returns a row array with the means of each column in R1
VARCOL(R1) returns a row array with sample variances of each column in R1
STDEVCOL(R1) returns a row array with sample standard deviations of each column in R1
STDCOL(R1) returns an array that has the same dimensions as R1 but with a standardization of the values in each column
SUMROW(R1) returns a column array with sums of each row in R1
MEANROW(R1) returns a column array with means of each row in R1
VARROW(R1) returns a column array with sample variances of each row in R1
STDEVROW(R1) returns a column array with sample standard deviations of each row in R1
RANKS(R1, order, num_digits) returns a column array with RANK_AVG(x, R1, order, num_digits) for each numeric element x in R1. 
RANKCOL(R1, order, num_digits) returns an array of the same size and shape as R1, each column of which contains the ranks for that column (the data in each column is ranked separately).
TRIMDATA(R1, p) returns a column array equivalent to R1 after removing the lowest and highest p/2 of the data values.
WINSORIZE(R1, p) returns a column array that is the Winsorized version of R1 replacing the lowest and highest p/2 of the data values.
XSTANDARDIZE(R1) returns an array that is equivalent to the array formula =STANDARDIZE(R1,AVERAGE(R1),STDEV.S(R1))

TRIMDATA, WINSORIZE, and WINMEAN can optionally take a third argument, p1, When p1 is present then the lowest p % and highest p1 % of the data values in R1 are removed/replaced.

Weighted statistics

In the following R1 contains data and R2 contains a column array of weights.

MEAN(R1, R2) weighted average of elements in R1; equivalent to AVERAGE(R1) if R2 is omitted
MED(R1, R2, p) weighted percentile of elements in R1 based on p (default .5); equivalent to weighted median if p is omitted, equivalent to MEDIAN(R1) if R2 and p are omitted
WVAR(R1, R2, rel) weighted variance of elements in R1; equivalent to VAR.S(R1) if R2 is omitted
WSTDEV(R1, R2, rel) weighted standard deviation of elements in R1; equivalent to STDEV.S(R1) if R2 is omitted
WCOV(R1, R2, rel) weighted covariance matrix for data in R1; equivalent to COV(R1) if R2 is omitted and rel = TRUE.

If rel (default TRUE), then reliability weights are used; otherwise frequency weights are used.

Divergence

KL_DIVERGE(R1, R2) Kullback-Leibler Divergence for R1 || R2
JS_DIVERGE(R1, R2) Jensen-Shannon Divergence for R1 and R2
CS_DIVERGE(R1, R2, R3) Divergence for credit scoring for up to three column arrays; e.g. R1 contains data, and R2 & R3 contain frequencies

Frequency table

The following are array functions:

FREQTABLE(R1) returns an n × 3 array that contains the frequency table for the data in R1, where n = the number of unique values in R1 (i.e. the number of data elements in R1 without duplicates)
FREQTABLE(R1, bsize, bmax) returns the frequency table for the data in R1, assuming equally sized bins of size bsize; bmax = largest bin value, if omitted, maximum value in R1 is used
FREQ2RAW(R1, R2) returns a column array with the equivalent raw data based on a frequency table described by R1 (data items) and R2 (frequencies)
FREQ2RAW(R1) equivalent to FREQ2RAW(R0, R2) where R2 is the last column in R1 and R2 are the other column in R1
FREQ_REFORMAT(R1) returns a two-column array with data equivalent to R1 but suitable for creating a histogram with a scatter plot

Frequency table descriptive statistics

For the following functions, R1 and R2 define a frequency table where R1 contains data items and R2 contains the corresponding frequencies.

MEDIAN_FREQ(R1, R2) median of the data in the frequency table
MAD_FREQ(R1, R2) median absolute deviation (MAD) of the data in the frequency table
IQR_FREQ(R1, R2, bexc) inter-quartile range (IQR) of the data in the frequency table
SMALL_FREQ(R1, R2, k) kth smallest element of the data in the frequency table
PERCENTILE_FREQ (R1, R2, p, bexc) pth percentile of the data in the frequency table
QUARTILE_FREQ (R1, R2, q, bexc) qth percentile of the data in the frequency table where q = 1, 2, 3 or 4.

If bexc = TRUE, then the exclusive version of percentiles is used; otherwise (default) the inclusive version is used.

Box-Cox Transformation

BOXCOX(R1, λ) outputs an array containing the Box-Cox transformation of the data in R1 using the given λ value. If λ is omitted, then the transformation that best normalizes the data in R1 is used based on maximizing the log-likelihood function.
BOXCOXLL(R1, λ)  log-likelihood function of the Box-Cox transformation of the data in R1 using the given lambda value
BOXCOXLambda(R1) lambda value that maximizes the log-likelihood function of the Box-Cox transformation of the data in R1

Counts

COUNTU(R1) count of the number of unique numeric values in R1
COUNTAU(R1) count of the number of unique non-empty cells in R1
COUNTCOL(R1) array function which returns a row array with the number of numeric values in each column of R1
COUNTROW(R1) array function which returns a column array with the number of numeric values in each row of R1
CountRowsUnique(R1, head, ncols) count of the number of unique rows in R1; the last ncols columns of R1 are not considered when determining uniqueness (ncols = 0).
CountFullRows(R1, blank, head) number of rows in R1 that don’t have any blank entries if blank = TRUE (default) and the number of rows in R1 that don’t have any non-numeric entries if blank = FALSE
CountPairs(R1, R2, blank) number of pairs of entries from R1 and R2 for which neither pair is blank if blank = TRUE (default) and number of pairs from R1 and R2 for which neither pair is non-numeric if blank = FALSE

If head = TRUE (default FALSE) then the first row of R1 (presumably a heading) is not counted.

Sorting and eliminating duplicates

The following are array functions

QSORT(R1, b) fills highlighted range with data from R1 in sorted order (by columns); b is an optional parameter (default = TRUE); if b is TRUE then sort is in ascending order and if b is FALSE (or 0) sort is in descending order.
QSORTRows(R1, col, ascend, head) fills highlighted range with rows from R1 in sorted order based on the elements in the colth column of R1; if ascend is TRUE (default) then sort is in ascending order and if ascend is FALSE (or 0) sort is in descending order; if head = TRUE (default is FALSE) then first row is presumed to contain column headings and is not sorted
QSORT2Rows(R1, col1, col2, ascend, head) fills highlighted range with rows from R1 in sorted order using two sort keys col1 and col2; ascend and head are as for QSORTRows.
QSORT2RowsMixed(R1, col1, col2, ascend, head) sorts R1 (containing only numeric data) as for QSORT2Rows, except that if ascend is TRUE (default) then primary sort on col1 is in ascending order and secondary sort on col2 is in descending order, while if ascend is FALSE then primary sort is descending and secondary sort is ascending 
SortUnique(R1, s) fills highlighted range with unique elements from R1 in ascending sorted order; output is a single column
SortsUnique(R1) returns a column array with unique elements from R1 in ascending sorted order
ExtractUnique(R1, s) fills highlighted range with unique elements from R1 in the order they appear first in R1
ExtractsUnique(R1) returns a column array with unique elements from R1 in the order they appear first in R1; output is a single column
NODUPES(R1, s, b) fills highlighted range with data from R1 eliminating any duplicates (by columns); if b = TRUE then the data in R1 is sorted first (by column) and if b = FALSE then it is assumed that R1 is in sorted order.
SortRows(R1, k, head) returns an array of the same size and shape as R1 which sorts the rows in R1 based on the sort keys consisting of the first k columns of R1 (default 1). If head = TRUE (default FALSE) then the first row of R1 is assumed to contain column headings and is retained as the heading of the output
SortRowsUnique(R1, s) fills highlighted range with unique rows from R1 in ascending sorted order; output has the same number of columns as R1.
SortRowsCount(R1, s) fills highlighted range with unique rows from R1 in ascending sorted order plus a count of the number of occurrences of that row in R1; output has one more column than R1.
SortRowsSum(R1, s) fills highlighted range with unique rows from R1 excluding the last column in ascending sorted order plus a sum of the values in the last column of R1 for matching occurrences; output has the same number of columns as R1.
SortsRowsUnique(R1) returns an array with the data in R1 in sorted order, eliminating any duplicate rows. 
SortsRowsCount(R1) returns an array with the data in R1 in sorted order, eliminating any duplicate rows. The last element of each row in the output contains the count of the number of rows in R1 which matches that row in the output; the output should have one more column than R1,
SortsRowsSum(R1) returns an array with the data in R1 in sorted order, excluding the elements in the last column and eliminating any duplicate rows.  The last element in each row of the output contains the sum of the values in the last cell of each of the rows in R1 that matches that row in the output; the output should have the same number of columns as R1

The string s is used as a filler in case the output range has more rows/cells than needed. This argument is optional and defaults to the error value #N/A.

Sampling and reshaping

The following are array functions:

REVERSE(R1, s) fills highlighted range with data from R1 in reverse order (by columns)
REVERSES(R1) returns an array of the same size and shape as R1 with the elements in reverse order
RESHAPE(R1, s) fills highlighted range with data from R1 (by columns)
RESHAPE(R1, s, nrows, ncols) fills an nrows × ncols range with data from R1 (by columns)
RESHAPES(R1, nrows, ncols, bycol) returns an nrows × ncols array with the elements from R1. If nrows (ncols) is missing then nrows (ncols) is set to the smallest value such that all the elements in R1 are output. If bycol = TRUE (default), then elements are returned in column order; otherwise, they are in row order.
DELROW(R1, k) returns an array identical to R1 except that the kth row has been removed
DELCOL(R1, k) returns an array identical to R1 except that the kth column has been removed
SHUFFLE(R1, s, nrows, ncols) fills highlighted range with a permutation of the data from R1 (sampling without replacement); alternatively, you can specify the shape of the output via nrows × ncols
SHUFFLES(R1) returns an array of the same size and shape as R1 with a permutation of the data from R1 (sampling w/o replacement)
RANDOMIZE(R1, s, nrows, ncols) fills highlighted range with a random selection of data from R1 (sampling with replacement); alternatively, you can specify the shape of the output via nrows × ncols
RANDOMIZES(R1) returns an array of the same size and shape as R1 with a random selection of data from R1 (sampling with replacement)
SHUFFLERows(R1) returns an array with the rows of R1 in random order
RANDOMIZRows(R1) returns an array of the same size and shape as R1 with a random selection of the rows from R1
ExtractRandRows(R1, col) returns an array such that for every value in column col of R1, one row with this value is selected at random from R1 and included in the output
DELBLANK(R1, s) fills highlighted range with data from R1 (by columns) omitting any empty cells
DELNonNum(R1, s) fills highlighted range with data from R1 (by columns) omitting any non-numeric cells
DELROWBLANK(R1, head, s) fills highlighted range with data from R1 (by columns) omitting any row with one or more empty cells; if head is TRUE then the first row of R1 (presumably containing column headings) is always copied (even if it contains a blank cell); the second argument is optional defaults to head = FALSE.
DELROWNonNum(R1, head, s) fills highlighted range with data from R1 (by columns) omitting any row with one or more non-numeric cells; if head is TRUE then the first row of R1 (presumably containing column headings) is always copied (even if it contains a non-numeric cell); the second argument is optional and defaults to head = FALSE.
DELROWS(R1, head, blank) returns an array with the data in R1 omitting any row that has one or more blank elements if blank = TRUE or one or more non-numeric elements if blank = FALSE (default); if head = TRUE, then the first row is always included in the output; otherwise (default), the first row is treated like any other row
DELErr(R1) returns an array of the same size and shape as R1 with data from R1 except that any elements with an error value are replaced by the empty value “”
SUBMATRIX(R1, row1, col1, nrows, ncols) returns an nrows × ncols subarray of R1 (nrows defaults to m – row1 + 1 and ncols defaults to n – col1 + 1) starting at the cell in the row1th row (default 1) and col1th column (default 1) of the m × n array R1.
BCOMPACT(R1, head) returns an m × (k+1) array which is a summary table for the data in the h × k array R1 whose kth column can only contain the values 0 or 1. The first k-1 columns of the output consist of the unique values in the first k-1 values of R1. The kth column consists of the counts of the matching rows from R1 that contain 1 in the kth column and the k+1th column consists of the counts of the matching rows from R1 that contain 0 in the kth column.
COMPACT(R1, head) returns an m × n array which is a summary table for the data in the h × k array R1. The first k-1 columns of the output consist of the unique values in the first k-1 values of R1. The remaining columns consist of the counts of rows in R1 that match the non-negative integer in the kth column of R1.
SelectCols(R1, ss, c) fills highlighted range with the columns from R1 defined by the string ss, where ss contains a list of column numbers of the columns to be selected from R1; if c > 0 then also sort the rows in the output based on the elements of the cth column in the output
IndexedValues(R1, type) returns a consolidation of unique rows in R1 omitting the last 2 columns plus one column for each unique value in the second-to-last column of R1 which contains a value based on type: 0 for “sum” (default), 1 for “use last match”, 2 for “use last match” or “dupe” if there are multiple matches
IndexedCounts(R1) returns a consolidation of unique rows in R1 omitting the last column plus one column for unique value in the last column of R1 containing counts

The string s is used as a filler in case the output range has more cells/rows than needed. This argument is optional and defaults to the error value #N/A.

Conditional Functions

COUNT_IF(R1, s) = # of times s appears in column array R1

COUNTS_IF(R1): returns an k × 2 array where the first column consists of all the unique values in column array R1 in sorted order and the second column consists of the values =COUNT_IF(R1, s) where s is the corresponding value in the first column

SUM_IF(R1, s, R2) = the sum of all values in column array R2 whose corresponding row in the column array R1 contains s.

SUMS_IF(R1, R2): returns an k × d+1 array where the first column consists of all the unique values in the m × d+1 array R1 in sorted order and the next d columns consist of the values =SUM_IF(R1, s, R2j) where s is the corresponding value in the first column and R2j is the jth column in R2.

AVG_IF(R1, s, R2): returns a 1 × d row array (or numerical element if d = 1), whose jth element is equal to the average of all values in the jth column of R2 whose corresponding row in R1 contains s.

AVGS_IF(R1, R2): returns an k × d+1 array where the first column consists of all the unique values in the m × d+1 array R1 in sorted order and the next d columns consist of the values =AVG_IF(R1, s, R2) where s is the corresponding value in the first column.

COV_IF(R1, s, R2): returns a d × d row array with the sample covariance matrix for the data in the matrix consisting of the rows in R2 where the corresponding element in R1 is s.

Click here for additional information about these worksheet functions.

Lambda Functions

Real Statistics provides the following functions that mimic Excel’s new lambda functions. Click here for more information about these functions.

  • LAMB(expression, parameter_list): returns a text string that represents the function defined by expression on the specified list of parameters
  • EVALS(expressionvalue_list): returns the value of the lambda function represented by expression at the specified list of values
  • EVALArray(expression, R0): returns the value of the lambda function represented by expression at the specified list of values in the column array R0
  • EVALFunc(lambda-expression, value_list): returns the value of the numeric lambda function represented by lambda-expression at the specified list of values (in set brackets)
  • BYCOLS(R1, expression, col): returns a row array each of whose elements is the value obtained by applying the function defined by expression to each of the columns in R1.
  • BYROWS(R1, expression, col): returns a column array each of whose elements is the value obtained by applying the function defined by expression to each of the rows in R1.
  • REDUCES(init, R1, expressionvalue, accumulator) = the value of the accumulator after processing the value of all the elements in R1. accumulator is initialized with the value init and for each element in R1, the accumulator is updated using expression.
  • SCANS(init, R1, expressionvalue, accumulator): returns an array of the same size and shape as R1 where each element in the output is the then-current value of the accumulator using the same approach as for REDUCES.
  • MAKESARRAY(nrowncolsexpressionrow, col): returns an nrows × ncols array each of whose values is defined by expression based on the row and col values.
  • MAPS(R1, expression, element): returns an array of the same size and shape as R1 each of whose elements is defined by expression based on the corresponding element in R1.

Data Table Functions

Related to these functions are the following Data Table functions:

  • EVALX(Rd, Rx, Rf, R1): returns an array of the same size and shape as the data range Rd with the values f(a) for all the values a in Rd based on the function f(x) defined by the expression in Rf with the variable Rx with an optional intermediate function defined by R1.
  • EVALXY(Rx, Ry, Rf, R1): fills the highlighted range with the values f(a,b) for all the values a in the column range to the immediate left of the highlighted range and all the values b in the row range immediately above the highlighted range, based on the function f(x,y) defined by the expression in Rf with the variables Rx and Ry with an optional intermediate function defined by R1.
  • EVAL2(x, y, Rx, Ry, Rf, R1): returns f(x,y) for the specified values of x and y., where Rx, Ry, Rf, and R1 are as in EVALXY.

Colors

FillRGB(R1) RGB background color value of the cells in range R1 (if the background colors of the cells differ then zero is returned)
FontRGB(R1) RGB font color value of the cells in range R1 (if the font colors of the cells differ then zero is returned).
RedRGB(n) red component (0 to 255) of the RGB value n
GreenRGB(n) green component (0 to 255) of the RGB value n
BlueRGB(n) blue component (0 to 255) of the RGB value n
RGBCode(text RGB value of the color described by designated color in text
FillColor(R1, approx) fill color of R1 as text. If approx = FALSE (default) an exact match is found; otherwise “unknown” is returned. If approx = TRUE then the closest of the named colors is returned.
FontColor(R1, approx) font color of R1 as text. If approx = FALSE (default) an exact match is found; otherwise “unknown” is returned. If approx = TRUE then the closest of the named colors is returned.
IsBold(R1) TRUE if all the cells in range R1 are bold and FALSE otherwise.
RGBDistSq(rgb1, rgb2) squared distance between the colors with RGB codes rgb1 and rgb2
ColorDistSq(color1, color2) squared distance between the named colors color1 and color2
Convert2Grey(rgb1) code of the greyscale equivalent to the usual RGB code rgb1

In addition, the following function is supported:

CodeImage(R1, height, width, onerow, greyscale): when onerow = FALSE (default), then this function returns a height × width array with the RGB codes of each cell in R1 if greyscale = FALSE (default) or greyscale codes, 0-255, if greyscale = TRUE.

When onerow = TRUE, then this function returns a row array with height × width columns with RGB or greyscale codes, as described above.

Text string

TEXT_JOIN (delim, ignore_empty, txt) returns a text string that concatenates all the text elements in txt inserting the delimiter delim between each of the text elements If ignore_empty = TRUE then any empty cells in txt are ignored
TEXT_REVERSE(s) returns a string with the characters in string s in reverse order
Array2String(R1) returns a string that represents the array R1
CharCount (s, c) returns the number of times the character c appears in string s

Categorical Coding

CATCODE(R1) array function that returns an array of the same size and shape as R1 with simple coding of values in R1
TAGCODE (R1, code, last) array function that returns an m × (k–1) array with dummy coding of values in column array R1 where R1 has m rows and k unique values; if code = TRUE (default) 0,1 coding is used; otherwise, 0,1,-1 coding is used. If last = TRUE (default) then the last new element in R1 is the reference; otherwise, the first element in R1 is the reference category.

Other functions

FText(R1) formula in cell R1 as text
VER() current version of the Real Statistics Resource Pack (e.g. 9.2 Excel 2016/2019/2021/365 Mac)
ExcelVer() Excel version (e.g. Excel 2010 is 14.0)
SheetNames() returns a column array with a list of all the sheet names for the current workbook
INDEX_MAX(R1, excl) relative position of the largest element in the column array or range R1 excluding the excl element (if argument is present)
INDEX_MIN(R1, excl) relative position of the smallest element in the column array or range R1 excluding the excl element (if present)
FirstNonNum(R1, ntype) address of the first empty or non-numeric element in R1 based on ntype. If R1 contains no non-numeric elements then the value “None” is returned.
RANDOM(a, b, inter, seed) non-volatile random number between a and b; if inter = FALSE (default) then a decimal number is returned, if inter = FALSE an integer is returned; if seed = 0 (default), then no seed is used, otherwise seed is used as a seed value.
RANDX(n, seed) column array of size n consisting of non-volatile random numbers between 0 and 1. seed is as for RANDOM.
XGAMMA(x) gamma function at x, even when x is a non-integer negative number
LowerGamma(x, a) lower incomplete gamma function at x, a
UpperGamma(x, a) upper incomplete gamma function at x, a

Leave a Comment