Sorting and Removing Duplicates by Rows

In most of the statistical analyses that we perform, each row in a range will contain data for a single subject in a sample and the columns correspond to the variables under study. It will, therefore, be useful to sort data by rows based on a sort key (which corresponds to the heading of one of the columns).

Sorting by Rows

Real Statistics Functions: The following array functions in the Real Statistics Resource Pack are used to sort rows in an array or cell range.

SORTRows(R1, k, head) – returns an array of the same size and shape as R1 which sorts the rows in R1 based on the sort keys consisting of the first k columns of R1 (default 1). If head = TRUE (default FALSE) then the first row of R1 is assumed to contain column headings and is retained as the heading of the output.

QSORTRows(R1, col1, ascend, head) – returns an array with the data in R1 in sorted order based on the elements in column  (the “sort key”) of R1. If ascend = TRUE (default), the sort is in ascending order; if ascend = FALSE, the sort is in descending order.

QSORT2Rows(R1, col1, col2, ascend, head) – returns an array with the data in R1 in sorted order based on the elements in column col1 of R1 (the “primary key”). If two rows have the same value in column col1, then these rows are further sorted based on the value in column col2 (“the secondary key”). If ascend = TRUE (default), the sort is in ascending order; if ascend = FALSE, the sort is in descending order.

QSORT2RowsMixed(R1, col1, col2, ascend, head) – returns an array with the data in R1 as for QSORT2Rows, except that if ascend = TRUE (default), then sorting based on the primary key is in ascending order while the sorting based on the secondary key is in descending order, while if ascend = FALSE, then sorting based on the primary key is in descending order while the sorting based on the secondary key is in ascending order.

If col1 is omitted it defaults to 1, i.e. the sort is based on the elements in the first column of R1. If col2 is omitted it defaults to 2, i.e. the sort is based on the elements in the second column of R1. If head = TRUE, then the first row contains a heading that is not to be sorted; the default is FALSE. 

When one of these functions is used in a highlighted range and that range has more cells than R1, the extra cells are filled in with the error value #N/A.

The array or range R1 in QSORT2RowsMixed can only contain numeric data, while R1 in the other two functions can contain any alphanumeric values.

Eliminating Duplicates

The following array functions contained in the Real Statistics Resource Pack are used to sort rows eliminating any duplicates.

SortRowsUnique(R1, s) – returns an array with the data in R1 in sorted order, eliminating any duplicate rows; in general, the output should have the same number of columns as R1.

SortRowsCount(R1, s) – returns an array with the data in R1 in sorted order, eliminating any duplicate rows. An additional column is appended to the output containing counts; in particular for each row in the output, the last column contains a count of the rows in R1 that match that row in the output.

SortRowsSum(R1, s) – returns an array with the data in R1 in sorted order, excluding the elements in the last column and eliminating any duplicate rows. For each row in the output, the last column contains the sum of the values in the last column of those rows in R1 that match that row in the output; the output should have the same number of columns as R1.

The string s is used as a filler when the output has more elements than are needed. This argument is optional and defaults to the error value #N/A.

Use with Dynamic Arrays

The following additional functions are useful with dynamic arrays (see Dynamic Array Formulas).

SortsRowsUnique(R1): returns an array with the data in R1 in sorted order, eliminating any duplicate rows; the output should have the same number of columns as R1.

SortsRowsCount(R1): returns an array with the data in R1 in sorted order, eliminating any duplicate rows. An additional column is appended to the output containing counts; in particular for each row in the output, the last column contains a count of the rows in R1 that match that row in the output.

SortsRowsSum(R1): returns an array with the data in R1 in sorted order, excluding the elements in the last column and eliminating any duplicate rows. For each row in the output, the last column contains the sum of the values in the last column of those rows in R1 that match that row in the output; the output should have the same number of columns as R1.

The following function is also available from the Real Statistics Resource Pack.

CountRowsUnique(R1, head, ncols) = a count of the number of unique rows in R1. If head = TRUE (default FALSE) then the first row of R1 (presumably a heading) is not counted; the last ncols columns of R1 are not considered when determining uniqueness (ncols = 0).

Example 1: Sort the rows in range T3:W14 in Figure 1 based on the elements in its 4th column (i.e. column W). Also, sort range T3:W14 based on the first column.

The results are shown in Figure 1.

Sorting by rows

Figure 1 – Sorting by rows

Note that when there are ties, the original order is used. Note too that the output in range AD3:AG14 can also be obtained via the array formula =SORTRows(T3,W14).

Example 2: Apply the SortRowsUnique, SortRowsCount, and SortRowsSum functions to the data in range A3:D14 of Figure 2.

Sort unique by rows

Figure 2 – Sorting unique by rows

Note too that =CountRowsUnique(A3:C14) yields the value 8.

Data Analysis Tool

Real Statistics Data Analysis Tool: The Reformatting a Data Range by Rows data analysis tool supplied in the Real Statistics Resource Pack provides an easier-to-use version of the QSORTRows function.

For example, to sort the data in range T3:W14 of Figure 1 based on the elements in its 4th column, press Ctrl-m, and choose the Reformatting a Data Range by Rows option.

Fill in the dialog box that appears as shown in Figure 3 and click on the OK button.

Dialog box

Figure 3 – Dialog box for sorting a data range by rows

The output is displayed as shown in range Y3:AB14 of Figure 1.

Leave a Comment