As described elsewhere in this website, especially regarding regression (see ANOVA using Regression), it is common to create dummy (or tag) coding for categorical variables. We have employed both the usual coding (using 1 and 0) as well as the alternative coding (using 1, 0, -1).
On this webpage we show how to create the dummy coding using worksheetl functions found in the Real Statistics Resource Pack. We also demonstrate an even more straightforward coding, called the simple coding, where categories such as Democrat, Republican, Independent are coded as 0, 1, 2.
Worksheet Functions
Real Statistics Functions: The following array functions are provided by the Real Statistics Resource Pack.
CATCODE(R1): If R1 is an m × n array then this function returns an m × n array with the simple coding of the corresponding cell in R1, i.e. a number from 0 to k–1 where k = the number of unique values in R1 (the numbering is in the order of the first appearance of that unique value in R1).
TAGCODE (R1, b, last): if R1 is an m × 1 column range with k unique values, then this function returns an m × (k–1) range with the dummy coding of the values in R1 using the 1/0 coding when b = TRUE (the default) and the 1/0/-1 coding when b = FALSE. If last = TRUE (default), reference code is last unique element in R1; otherwise, the first element in R1 is used.
Note that empty cells are not coded (alternatively you can think of them as being coded as empty).
Example
Example 1: Find the three different types of coding for the Season variable based on the data in range B4:B13 of Figure 1
Figure 1 – Categorical Coding
The simple coding of the Season column is shown in the range F4:F13, which is calculated using the array formula =CATCODE(B4:B13). The E and GÂ columns are copied from columns A and C since we are not coding them.
The usual dummy coding is calculated using the array formula =TAGCODE(B4:B13). Since there are 4 values for the Season variable, there are 4–1 = 3 dummy variables with values shown in range I4:K13.
The alternative dummy coding is shown in range M4:O13Â using the array formula =TAGCODE(B4:B13,FALSE).
Data Analysis Tool
Real Statistics Data Analysis Tool: The Real Statistics Extract Columns data analysis tool also provides a way to perform any one of the codings described above. For more information about this, see Categorical Coding for Regression.
Examples Workbook
Click here to download the Excel workbook with the examples described on this webpage.
References
Howell, D. C. (2010) Statistical methods for psychology (7th ed.). Wadsworth, Cengage Learning.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf
Hi,
I am trying to use the CATCODE Function to create categorical variable values, but every time I attempt it with my data I receive a single entry of zero.
How should I be inputting the values for this function? What am I doing wrong?
Thank you in advance for your help!
CATCODE is an array function and so you need to use it in a different way. See the following webpage for how to use an array function:
Array Formulas and Functions
Charles
Hallo Charles
I had sent you an email asking for remote support
i ma having difficulties in running a multinomial logistic regression with Real Statistics
I just checked my emails and I don’t see an email from you in the past two weeks. Can you please resend your email_
Charles
Hey Charles,
Real statistics is a great portal for learning analytics.
I had a doubt can we use simple coding of categorical variables to convert it into numeric data and apply multiple regression on it.
You can use simple coding provided there is order to the categories (in which case the data is not categorical). If there is no order then you need to use dummy coding.
Charles
The example of how to use tagcode in the last two paragraphs doesn’t look right to me. Shouldn’t the formulas be =TAGCODE(B4:B13) and =TAGCODE(B4:B13,false)?
Alan,
You are correct. I have changed the webpage to correct this error. Thanks very much for catching this error.
Charles
This is not working. Could you explain more of what is being called and returned in the tagcode function? Am I to enter that funtion in one cell and it populates to the whole range?
James,
Suppose you have the following data in range A1:A6
A
2
A
3
A
2
Highlight the range C1:D6 (i.e. any 6 x 2 range — 6 because the original data has 6 rows and 2 (= 3 – 1) because the original data has 3 distinct values (A, 2, 3). If you now enter the formula =TAGCODE(A1:A6) in the highlighted range and press Ctrl-Shft-Enter the output is as follows:
1 0
0 1
1 0
0 0
1 0
0 1
Charles
Wonderful site and wonderful work! Statistics for the masses . . .
Small correction re: categorical variables:
“The E and F columns are ” should read “The E and G columns are “.
Roger,
Thanks for catching this typing mistake. I have now change the webpage as you have sugested.
Charles