Introduction
When data is organized in the form of a contingency table (see Independence Testing) where the two categorical independent variables (corresponding to the row and columns) are ordered, then we can calculate a polychoric correlation coefficient. This coefficient is an approximation to what Pearson’s correlation coefficient would be if we had continuous data. For a 2 × 2 contingency table, the polychoric correlation coefficient is called the tetrachoric correlation coefficient.
You can think of the tetrachoric correlation between two dichotomous variables x and y as Pearson’s correlation between x′ and y′ where x′ and y′ are “latent” variables that are normally distributed (actually bivariate normally distributed) and x = 1 when x′ > c and y′ = 1 when y′ > d for some unknown constants c and d.
Example
Looking at the marginal totals on the left side of Figure 1, we see that cell O5 can take any integer value between 0 and 10. Pearson’s correlation coefficient for each of these options is shown in column T. Cell T3, for example, contains the formula =SQRT(CHI_STAT(O4:P5)/Q6) where 0 is inserted in cell O5, and similarly for the other cells in column T (see Chi-square Effect Size). Note that the values range from -.7071 to .7071.
The corresponding tetrachoric correlation coefficients shown in column U are higher than the corresponding Pearson’s correlation coefficients and range from -1 to +1. In Tetrachoric Correlation Estimation we show how to calculate these tetrachoric correlation coefficients (using method 1 estimates).
Figure 1 – Comparing Pearson’s and tetrachoric correlation
Example Resolution
For example, suppose we have a 2 × 2 with the marginal totals shown in Figure 1.
The situation is similar for the polychoric correlations. Suppose we have an m × n contingency table. In this case, x can take values 0 to m–1 and y can take values 0 to n–1. We now have constants c0, …, cm and d0, …, dn where c0 = d0 = -∞ and cm = dn = ∞ and latent variables x′ and y′ such that x = i when ci < x′ ≤ ci+1 and y = j when dj < y′ ≤ dj+1.
If x ∼ N(μ, σ2), then P(x = i) = NORM.DIST(ci+1, μ, σ, TRUE) – NORM.DIST(ci, μ, σ, TRUE). In fact, if we set ui = (ci – μ)/σ, then P(x = i) = NORM.S.DIST(ui+1, TRUE) – NORM.S.DIST(ui, TRUE), which is the area under the standard normal curve between x = ui+1 and x = ui. The same is true for y′. Henceforth, we will assume that we are using standard normal distributions and use the constants ci instead of ui, and similarly for dj.
Suppose that the elements in the contingency table are hij. Then the marginal total for the ith row in the contingency table is ai = hij and the marginal total for the jth row is bj = hij, with a grand total of g = ai = bj.
A natural estimate for ci for 0 < i < m is ci = NORM.S.INV(ai/g) and a natural estimate for dj for 0 < j < n is dj = NORM.S.INV(bj/g).
Now, assuming bivariate normality, we also have
P(x = i, y = j) = BNORMSDIST(ci+1,dj+1,ρ,TRUE) – BNORMSDIST(ci+1,dj,ρ,TRUE)
where BNORMSDIST(x, y, ρ, TRUE) is the Real Statistics formula that computes the bivariate standard normal distribution (cdf) at (x, y) when the correlation coefficient is ρ (see Multivariate Normality Functions).
Note, however, that BNORMSDIST(x, y, ρ, TRUE) is not defined when x or y is equal to ±∞. In these cases, we consider
BNORMSDIST(–∞, y, ρ, TRUE) = BNORMSDIST(x, ∞, ρ, TRUE) = 0
BNORMSDIST(∞, y, ρ, TRUE) = NORM.S.DIST(y, TRUE)
BNORMSDIST(x, ∞, ρ, TRUE) = NORM.S.DIST(x, TRUE)
BNORMSDIST(∞, ∞, ρ, TRUE) = 1
Examples Workbook
Click here to download the Excel workbook with the examples described on this webpage.
References
Uebersax, J. S. (2015) Introduction to the tetrachoric and polychoric correlation coefficients
http://www.john-uebersax.com/stat/tetra.htm
Mahler, C.M. (2016) The tetrachoric correlation coefficient
https://eigenblogger.com/tag/tetrachoric-correlation-coefficient/
STATA (2017) Tetrachoric correlations for binary variables
www.stata.com/manuals13/rtetrachoric.pdf
Hello Charles,
Thanks for the free software!
I am confused at the polychoric correlation matrix, together with
the Corr program. You mentioned that there are only two underlying latent variables.
But, just what does that mean? The corr function will output a correlation
matrix. A correlation matrix allows us to do Factor Analysis. What guarantees
that we will end up with two factors in the process? Will the correlation matrix just have two eigenvalues with absolute value >1? Or will all variables load into just two factors?
If I am too far off, could you please give me a ref. to a proof of this?
Thanks.
The tetrachoric correlation refers to two latent variables. This is not surprising since since the tetrachoric correlation is between two real variables.
A polychoric correlation matrix can have more than two variables, which in turn will refer to more than two latent variables.
The CORR function refers to pairwise Pearson’s correlation coefficients. With k variables, this will be a k x k matrix. Such a matrix will have k eigenvalues, except where k = 2, this will be more than 2 eigenvalues.
In general, polychoric correlations are not used much, but Pearson’s correlations are used a lot.
Charles
Thank you Charles, very helpful.